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ABSTRACT 


Title  of  Dissertation;  Efficient  Parallel  Algorithms  on  The  Network  Model 
Kwan  Woo  Ryu,  Doctor  of  Philosophy,  1990 

Dissertation  directed  by;  Joseph  F.  JaJa,  Professor,  Electrical  Engineering 


We  develop  efficient  parallel  algorithms  for  several  fundamental  problems  on 
the  hypercube,  the  shuffle-exchange,  the  cube-connected  cycles  and  the  butterfly. 
Those  problems  are  related  to  load  balancing,  packet  routing,  list  ranking,  graph 
theory  and  VLSI  routing. 

Load  balancing,  sorting  and  packet  routing  problems  have  been  studied  heav¬ 
ily  on  various  parallel  models.  There  are  some  optimal  algorithms  for  these 
problems  on  few  networks.  We  introduce  a  new  simple  and  efficient  algorithm 
for  load  balancing  on  our  networks,  and  show  that  load  balancing  requires  more 
time  on  our  bounded-degree  networks  than  on  the  weak  hypercube.\  We  also 
show  that  sorting  n  integers,  each  of  which  bounded  by  can  bei  done  in 

0{~)  time  on  the  pipelined  hypercube,  whenever  n  =  for  some  fixed 

e  >  0.  Using  these  results,  we  provide  an  efficient  algorithm  for  packet  routing 
on  several  networks. 


An  algorithm  will  be  called  almost  uniformly  optimal  if  it  is  provably  optimal 
whenever  p  <  for  some  fixed  constant  k.  We  present  almost  uniformly 

optimal  algorithms  to  solve  several  problems  such  as  the,  all  nearest  smaller 
values  (ANSV)  problem  and  the  line  packing  problem  on  our  networks. 

List  ranking  is  a  basic  problem  whose  efficient  solution  can  be  used  in  many 
graph  algorithms.  We  describe  an  algorithm  to  solve  the  list  ranking  problem  on 
the  pipelined  hypercube  in  time(9(^)  when  n  =  and  in  time  0(  -f- 

log^p)  otherwise.  This  clearly  attains  a  linear  speed-up  when  n  =  We 

use  this  algorithm  to  obtain  efficient  algorithms  for  many  basic  graph  problems 
such  as  tree  expression  evaluation,  connected  and  biconnected  components,  ear 
decomposition  and  st-numbering  on  the  networks. 


Finally,  we  develop  parallel  algorithms  for  several  one-layer  routing  problems. 

It  is  shown  that  the  detailed  routing  and  the  routability  testing  problems  within 
a  rectangle  can  each  be  solved  in  time  O(^)  on  the  pipelined  hypercube  when 
n  =  These  problems  are  also  addressed  in  the  other  network  models. 
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Chapter  1 


Introduction 


In  recent  years,  we  have  seen  a  tremendous  surge  in  the  availability  of  \  ery  fast 
and  inexpensive  hardware.  This  has  been  made  possible  partly  by  the  use  of 
faster  circuit  technologies  and  smaller  feature  sizes;  partly  by  novel  architec¬ 
tural  features  such  as  pipelining,  vector  processing,  cache  memories,  and  sys¬ 
tolic  arrays:  and  partly  by  using  novel  interconnections  between  processors  and 
memories,  such  as  hypercube,  omega  network,  cube-connected  cycles,  and  oth¬ 
ers.  Such  hardware  technologies  made  it  possible  to  design  parallel  computers 
-  computers  consisting  of  a  number  of  processors  dedicated  to  solving  a  single 
problem  at  a  time  -  by  putting  thousands  of  processors  together.  In  fact,  par¬ 
allel  computers,  such  as  the  Connection  Machine  from  Thinking  Machines  Inc. 
and  the  iPSC  series  from  Intel  Corp.,  with  thousands  of  processors,  are  already 
available  commercially.  However,  the  tremendous  computing  power  that  has 
become  available  can  be  realized  only  if  we  design  efficient  parallel  algorithms 
which  run  on  these  parallel  computers.  In  this  thesis,  we  develop  parallel  algo¬ 
rithms  which  can  be  efficiently  implemented  on  models  that  are  abstractions  of 
such  parallel  computers. 

The  rest  of  the  chapter  is  organized  as  follows.  The  outline,  the  main  contri¬ 
butions,  and  the  summary  of  results  of  the  thesis  are  described  in  Section  1.1. 
1  he  computational  models  -  the  PR.A.M  model  and  the  network  model  -  are 
reviewed  briefly  in  Sections  1.2  and  1.3,  respectively.  The  last  section  reviews 
several  basic  hypercube  algorithms  and  routing  schemes. 

1.1  Outline 

L<'t  7’i(n)  be  the  running  time  of  the  optimal  secpiential  algorithm  '  for  solving 
a  problem  fl-  where  n  is  the  length  of  the  input  of  fl-  I><'1  7|i(^0  1’*’  the  running 

*An  o[)tiinal  .srqiicntial  algoritlim  does  not  neccs.sarily  »'xist.  Soi'  [i;!]  for  a  discn.s.sioii  of 
tlio  issue.  All  of  the  protdenis  disriis.sod  in  this  tliesis  have  optimal  secpK'ntial  algorithms,  .so 
this  diiririilty  doe.s  not  arise. 
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time  of  a  parallel  algorithm  for  solving  H  with  p  processors.  Then  the  speed-up, 
Sp{n),  of  the  parallel  algorithm  for  solving  0  is  .  It  measures  how  many 
times  faster  a  parallel  algorithm  is  than  a  sequential  one.  Clearly,  1  <  Sp(n)  <  p. 
The  efficiency,  Ep{n),  of  the  parallel  algorithm  is  It  measures  how  effective 

each  processor  in  a  parallel  algorithm  is  relative  to  a  sequential  algorithm.  It 
normalizes  the  speed-up,  so  0  <  Ep{n)  <  1.  A  parallel  algorithm  that  runs 
in  time  Tp{n)  is  said  to  be  efficient  if  Ep{n)  =  0(1),  and  is  said  to  be  almost 
efficient  if  Ep{n)  =  A  primary  goal  in  parallel  computation  is  to  design 

efficient  or  almost  efficient  algorithms  that  also  run  as  fast  as  possible  [4-3]. 

The  goal  of  this  thesis  can  be  described  as  follows'  Civen  a  network  .\f  with 
p  processors  and  a  problem  Hi  efficient  algorithm  to  solve  H  ori  for 

all  problem  sizes  n  >  p.  Since  this  goal  is  difficult  to  achieve,  we  will  often  be 
satisfied  with  finding  an  efficient  algorithm  when  n  >  p(logp)®*’l,  or  even  when 
n  =  for  some  fixed  c  >  0.  In  this  thesis,  we  show  several  results  related 

to  load  balancing,  sorting,  packet  routing,  list  ranking,  graph  theory,  and  VLSI 
routing  on  the  pipelined  hypercube,  the  weak  hypercube,  the  shuffie-e.xchange. 
the  cube-connected  cycles,  and  the  butterfly.  These  results  include  the  following. 

•  Development  of  provably  efficient  algorithms  on  the  network  models. 

•  Establishment  of  lower  bounds  on  weak  hypercube  and  bounded-degree 

networks.  All  the  problems  considered  are  shown  to  require  Q{  time 

on  bounded-degree  networks. 

These  results  shed  some  light  on  the  relative  powers  of  the  pipelined  hypercube, 
the  weak  hypercube,  and  the  bounded-degree  networks. 

We  now  outline  the  thesis,  and  consider  the  main  contributions  one  by  one. 
In  Chapter  2,  we  show  several  results  related  to  load  balancing,  sorting,  and 
relate  them  to  the  general  packet  routing  problem. 

Balancing  load  among  processors  is  very  important  since  poor  balance  of 
load  generally  causes  poor  processor  utilization.  The  load  balancing  problem  is 
defined  as  follows.  Let  n  items  be  distributed  over  the  p  processors  of  a  network, 
with  no  more  than  M  items  assigned  to  any  single  processor  ([n/p]  <  M  <  n). 
The  problem  is  to  redistribute  the  items  so  that  the  number  of  items  in  any  two 
processors  may  differ  by  at  most  one. 

Kruskal  et  al.  studied  load  balancing  (to  solve  the  list  ranking  problem) 
on  the  complete  network  [42].  Peleg  and  Upfal  developed  an  algorithm  for  this 
problem  whose  time  complexity  is  0{M  -f  logp  •  min(log  log  log p))  on  the 
bounded-degree  network  based  on  expander  graphs  [58].  Plaxton  developed  a 
weak  hypercube  algorithm  whose  time  complexity  is  0{M \/log  p  -f  log^  p)  [61]. 

We  present  an  algorithm  for  load  balancing  whose  time  complexity  is  0(.\/  -(- 
logp)  on  the  pipelined  hypercube  and  0{M  logp)  on  the  shuffle-exchange,  cube- 
connected  cycles  and  butterfly.  This  algorithm  is  optimal  on  the  pipelined  hy¬ 
percube.  We  also  provide  a  lower  bound  for  our  bounded-degree  networks,  and 
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show  that  load  balancing  requires  more  time  on  the  shuffle-exchange,  the  cube- 
connected-cycles,  or  the  butterfly  than  on  the  weak  hypercube. 

Sorting  is  a  fundamental  computational  problem  that  has  been  investigated 
for  several  decades.  This  problem  can  be  solved  in  0(nlogn)  time  sequen¬ 
tially.  More  recently,  efficient  parallel  sorting  algorithms  on  the  PRA.M  model 
[12,24]  and  on  the  network  model  [1,18,39,47,55,61,64,78,82]  have  been  devel¬ 
oped.  Cole  developed  an  optimal  Q(- j  algorithm  for  the  EREW  PRAM 
[12].  On  the  network  model,  Leighton  developed  an  0(  —  algorithm  for  his 
bounded-degree  network  based  on  the  AKS  sorting  network  [47];  Cypher  and 
Sanz  developed  an  )  algorithm  for  shuffle-exchange  when  n  = 

for  some  k  <  2  [18];  and  Varman  and  Doshi  developed  an  0{  q.  log^  p) 
algorithm  for  the  pipelined  hypercube  [82]. 

Many  of  our  algorithms  in  this  thesis  need  to  sort  integers  from  a  small 
range  efficiently.  This  can  be  done  in  linear  time  sequentially  when  the  range  is 
polynomial  in  the  number  of  integers.  For  parallel  algorithm,  Hagerup  developed 
an  Oi  algorithm  for  the  CRCVV  PRAM,  for  1  <  p  <  [24]. 

Clearly,  t^is  algorithm  is  not  efficient. 

On  the  network  model,  Aggarwal  and  Huang  developed  an  algo¬ 

rithm  for  the  cube-connected  cycles  [1],  and  Han  developed  an  0{^)  algorithm 
for  the  complete  network  [27],  whenever  n  =  ^(p''''').  We  present  an  0{~) 
algorithm  for  the  pipelined  hypercube,  whenever  n  =  n(p' ■*■').  Integer  sort¬ 
ing  requires  Q(  time  on  the  weak  hypercube  and  on  any  bounded-degree 

network.  Thus,  Aggarwal  and  Huang,  and  our  algorithms  are  optimal  (for 
n  =  n(p'+')). 

The  load  balancing  algorithm  and  the  integer  sorting  algorithm  are  used  to 
find  an  efficient  solution  for  the  general  packet  routing  problem.  The  {n,ki,k2) 
routing  problem  is  a  set  of  n  packets,  each  of  which  is  specified  by  a  source 
and  a  destination,  such  that  no  processor  appears  as  a  source  (respectively 
destination)  in  more  than  kt  (respectively  ^2)  packets.  The  problem  is  to  route 
these  requests  simultaneously.  When  ki  =  k^  =  this  problem  reduces  to 
a  permutation  problem.  Gottlieb  and  Kruskal  showed  that  this  permutation 
problem  requires  n(  time  on  any  bounded  degree  network  [23].  Clearly, 

this  permutation  problem  can  be  solved  by  using  the  integer  sorting  algorithms. 

Peleg  and  Upfal  developed  an  algorithm  for  this  routing  problem  whose  time 
complexity  Q{k\  +  k2  +  ” )  on  their  bounded-degree  network  based  on  ex¬ 

pander  graphs  [58]. 

We  develop  an  algorithm  whose  time  complexity  is  0{kx  +  k2  +  j)  on  the 
pipelined  hypercube,  and  0{{ki  +  ^-2)  logp-f  on  the  weak  hypercube  and 

our  bounded-degree  networks,  whenever  n  =  H(p*'^').  The  problem  requires 
H(  time  on  the  weak  hypercube  and  on  any  bounded -degree  networks. 

Thus  the  the  upper  bounds  are  tight  for  these  networks. 
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A  parallel  algorithm  is  almost  uniformly  optimal  if  its  running  time  is  prov- 
ably  the  best  possible  (up  to  a  constant  factor)  for  all  p  <  n/log^n,  for  some 
fixed  constant  k..  In  Chapter  3,  we  present  almost  uniformly  optimal  algorithms 
to  solve  several  problems  such  as  the  all  nearest  smaller  values  problem  (ANSV). 
triangulating  a  monotone  polygon,  and  line  packing. 

The  ANSV  problem  is  a  fundamental  problem  since  it  can  be  used  to  solve 
several  important  problems  such  as  triangulating  a  monotone  polygon,  recon¬ 
structing  a  binary  tree,  parenthesis  matching  [7],  and  line  packing  [9].  There 
is  a  simple  linear  time  sequential  algorithm  using  a  stack  for  this  problem. 
For  parallel  algorithm,  Berkman,  Schieber,  and  Vishkin  developed  an  optimal 
0(^  +  log  log  7i)  for  the  CRCW  PRAM  (7). 

VVe  present  an  algorithm  for  the  ANSV  problem  whose  time  complexity  is 
0(^-|-log‘’ p)  on  the  pipelined  hypercube  and  (9("*°gP-|-log'*  p)  on  all  the  remain¬ 
ing  networks.  This  network  algorithm  is  used  to  find  algorithms  for  triangulating 
a  monotone  polygon  and  line  packing.  VV^e  also  prove  that  the  problems  require 

n(  time  on  the  weak  hypercube  and  Q{  — time  on  our  bounded- 

degree  networks.  Thus,  these  algorithms  are  also  almost  uniformly  optimal  on 
our  bounded-degree  networks  (despite  being  only  almost  efficient). 

In  Chapter  4,  we  present  an  algorithm  to  solve  the  list  ranking  problem  on  the 
networks.  This  is  also  a  fundamental  problem  and  there  are  many  known  results 
for  this  problem  [2,14,15,16,17,26,27,42].  This  problem  ha^  a  simple  li  near  time 
sequential  algorithm. 

VVyllie  developed  the  first  parallel  algorithm  for  this  problem  [87].  This 
algorithm  uses  the  doubling  technique  and  can  be  implemented  in  0{  — "«")  time 
on  the  EREW  PRAM.  Cole  and  Vishkin  developed  0(  -  -f  log  n)  time  algorithms 
for  the  CRCW  PRAM  [16],  and  for  the  EREW  PRAM  [17]  Anderson  and  Miller 
developed  a  simplified  0{~  +  logn)  time  algorithm  for  the  EREW  PRAM  [2]. 
For  the  network  model,  Kruskal  et  al.  developed  an  0{j  +  p^)  algorithm  on  the 
complete  network  [42].  Han  improved  this  result  to  0(^  +  plogp)  [27]. 

We  present  a  list  ranking  algorithm  that  runs  on  the  pipelined  hypercube 
in  time  0{^)  when  n  =  ^(p*"'''),  and  in  time  0( +  log^p)  otherwise.  We 
use  these  techniques  to  obtain  fast  algorithms  for  several  basic  graph  problems 
such  as  tree  expression  evaluation,  connected  and  biconnected  components,  ear 
decomposition,  and  st-numbering.  These  problems  are  also  addressed  for  the 
other  network  models.  We  also  prove  that  list  ranking  requires  "'”SP)  time 
on  the  weak  hypercube  and  any  bounded-degree  network.  Thus,  our  algorithm 
is  optimal. 

In  Chapter  5,  we  present  fast  network  algorithms  for  several  one-layer  routing 
problems.  Actually,  many  of  the  optimization  problems  arising  in  VLSI  routing 
are  NP-complete  [41,46,67,76].  One  notable  exception  is  the  class  of  one-layer 
routing  problems  associated  with  a  hierarchical  layout  strategy  such  as  Bristle- 
Blocks  [36].  See  [13,20,48,49,51,53,59,72,79]  for  more  examples.  Efficient  serial 
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Figure  1.1:  The  PRAM  model 

solutions  have  been  developed  for  most  of  these  problems.  For  parallel  solutions, 
algorithms  that  could  run  in  0{j  +  logn)  time  on  the  CREW  PR,\M,  and  in 

0(^  +  log  log  „ )  time  on  the  CRCW  PRAM  were  developed  for  several  one-layer 
routing  problems  [10]. 

We  present  fast  algorithms  for  the  detailed  routing  and  the  routability  testing 
problems  within  a  rectangle  whose  time  complexities  are  0{^)  on  the  pipelined 
hypercube,  and  0{  —  Sf)  on  all  the  remaining  networks,  when  n  = 

Fast  algorithms  are  a^o  developed  for  several  subproblems  that  are  interesting 
on  their  own.  One  such  subproblem  is  to  determine  the  contours  of  the  union 
of  sets  of  contours  within  a  rectangle. 

In  Chapter  6,  we  summarize  the  results  obtained  in  this  thesis  and  describe 
directions  for  future  research. 


1.2  The  PRAM  Model 

The  PRAM  (Parallel  Random  Access  Machine)  consists  of  p  synchronous  pro¬ 
cessors,  /o,  Pi,  •  ■  Pp-\,  all  having  access  to  and  interchanging  data  through  a 

large  shared  memory  (Figure  1.1).  In  a  single  cycle,  each  processor  may  read  or 
write  a  data  from  or  into  a  shared  memory  cell,  or  else  perform  in  local  mem- 


ory  one  of  a  prescribed  set  of  operations  (various  tests,  arithmetic  operations. 
Boolean  operations,  etc.).  Each  processor  P,,  0  <  i  <  p— 1,  is  uniquely  identified 
by  an  index  i  which  can  be  referred  to  in  the  program. 

There  are  several  variations  of  the  above  general  model  based  on  the  as¬ 
sumptions  regarding  the  handling  of  the  simultaneous  access  of  several  proces¬ 
sors  to  a  single  location  of  the  common  memory.  An  EREW  (Exclusive- Read 
Exclusive- Write)  PRAM  does  not  allow  simultaneous  access  by  more  than  one 
processor  to  the  same  memory  location.  CREW  (Concurrent- Read  Exclusive- 
Write)  PRAM  allows  simultaneous  access  only  for  read  instructions.  A  CRCW 
(Concurrent- Read  Concurrent- Write)  PRAM  allows  simultaneous  access  for  both 
read  and  write  instructions.  On  the  Common  CRCW  PRA.M,  it  is  assumed  that 
if  several  processors  attempt  to  write  simultaneously  at  the  same  memory  lo¬ 
cation,  then  all  of  them  are  trying  to  write  the  same  value.  In  the  Arbitrary 
CRCW  PRAM,  it  is  assumed  that  one  of  the  processors  attempting  to  write 
simultaneously  at  the  same  memory  location  succeeds,  but  w’e  do  not  know  in 
advance  which  one.  On  the  Priority  CRCW  PRAM,  it  is  assumed  that  the 
processor  with  minimum  index  among  the  processors  attempting  to  write  simul¬ 
taneously  into  the  same  memory  location  succeeds.  It  turns  out  that  all  these 
machines  do  not  differ  substantially  in  their  computing  power,  and  that  their 
computing  power  increa.ses  in  a  strict  fashion  in  the  order  they  were  introduced. 

The  PRAM  model  of  parallel  computation  w'as  first  studied  in  the  late  70's  by 
a  number  of  researchers,  and  hcis  become  widely  used  henceforth.  This  research 
work  presents  some  justifications  to  the  selection  of  this  model  as  an  abstract 
model  of  parallel  computation.  The  reader  is  referred  to  [21,38,51,83]  for  surveys 
of  results  concerning  the  PRAM. 

The  efficiency  of  a  parallel  algorithm  is  mecisured  by  its  running  time  and 
the  number  of  processors  it  uses.  These  two  measures  are  strongly  related. 
The  following  theorem  due  to  Brent  [8]  implies  that  we  can  always  slow  down 
a  parallel  algorithm  by  reducing  the  number  of  its  processors  with  the  same 
processor-time  product.  This  is  the  reaison  why  we  often  measure  the  efficiency 
of  a  parallel  algorithm  by  its  minimal  running  time  and  the  number  of  processors 
required  to  achieve  this  running  time. 

Theorem  1.1  A  PRAM  algorithm  requiring  t  parallel  steps  and  a  total  of  x 
operations  can  be  implemented  by  a  p-processor  PRAM  within  +  /  parallel 
steps. 

Proof:  Let  x;  be  the  number  of  operations  performed  in  step  ?.  I  <  i  <  t.  The 
p-processor  PRAM  can  perform  the  x,  operations  in  steps.  Hence  the  total 
number  of  steps  on  the  p-processor  PRA.M  is 

1=1  t'  1=1  r  P 


6 


1.2.1  NC  and  P-completeness 

The  study  of  parallel  complexity  within  the  PRAM  model  has  led  to  some 
important  negative  results:  there  are  some  problems  that  are  not  likely  to  have 
fast  parallel  algorithms.  Let  P  be  the  set  of  decision  problems  solvable  by 
deterministic  Turing  machines  in  polynomial  time,  and  let  NC  be  the  set  of 
decision  problems  solvable  in  polylog  time,  i.e.,  in  time  O(log^^'^  n),  where  n  is 
the  length  of  the  input,  using  polynomial  number  of  processors  by  deterministic 
algorithms  [60].  Clearly,  NC  C  P.  A  fundamental  open  question  is  whether 
P  C  NC.  If  it  were  so,  it  would  mean,  roughly  speaking,  that  every  problem 
that  has  a  good  solution  in  a  sequential  model  of  computation  can  be  solved 
very  fast  in  parallel,  using  a  polynomial  number  of  processors. 

VV'e  adopt  the  usual  convention  of  representing  a  decision  problem  as  a  subset 
of  {0,1}*.  Decision  problem  Hi  said  to  be  logspace  reducible  to  decision 
problem  {12  there  is  a  function  /  :  {0,  1  }*  — >  {0, 1  }*  such  that  /  is  computable 
by  any  PRAM  in  polylog  time  using  polynomial  number  of  processors  and,  for 
all  X  6  {0,1}’,  X  G  Hi  if  ^nd  only  if  f(x)  6  Hi-  A  decision  problem  in  P 
is  called  P-complete  if  every  problem  in  P  is  logspace-reducible  to  it.  If  Hi  is 
logs  pace- reducible  to  {{2  112  is  in  NC,  then  fli  is  in  NC.  This  implies  that 

if  n  is  a  P-complete  problem,  then  P  =  AC  if  and  only  if  0  €  NC.  Thus 
the  P-complete  problems  can  be  viewed  as  the  problems  in  P  most  resistant  to 
parallelization. 

The  usual  method  of  showing  that  a  problem  is  P-complete  is  to  show  that  it 
lies  in  P  and  that  some  standard  P-complete  problem  is  logspace-reducible  to  it. 
P-complete  problems  include  the  circuit  value  problem,  the  greedy  independent 
set  problem,  and  the  maxflow  problem.  See  [38]  for  more  details  and  a  list  of 
these  problems. 


1.3  The  Network  Model 

The  PRAM  model  can  play  a  useful  role  as  a  theoretical  yardstick  for  measur¬ 
ing  the  limits  of  parallel  computation.  Since  the  communication  between  its 
processors  can  be  done  trivially  through  its  shared  memory,  this  model  can  be 
also  used  to  detect  the  intrinsic  parallelism  of  a  given  problem.  Moreover,  the 
teclmicpies  and  paradigms  provided  by  the  PRAM  algorithms  can  be  used  for 
designing  algorithms  on  a  more  realistic  model.  However,  the  PR.VM  mod('l  is 
not  easy  to  realize  physically  because  of  physical  fan-in  limitations.  In  a  physi¬ 
cally  realizable  assemblage,  we  can  only  expect  any  computing  element  to  have 
a  small  number  of  external  connections.  We  must  therefore  consider  parallel  as¬ 
semblages  in  which  a  large  number  of  communicating  processors,  ('ach  with  its 
own  memory,  are  connected  together,  but  where  each  jirocessor  communicates 
with  a  small  number  of  other  proces.sors. 


Figure  1.2:  The  network  model 


A  p-processor  fixed  interconnection  network  may  be  viewed  as  an  undirected 
graph,  where  vertices  correspond  to  processors  and  edges  correspond  to  commu¬ 
nication  links.  Each  processor  P,,  0  <  i  <  p—  1,  has  a  large  local  memory.  There 
is  no  shared  memory.  VVe  assume  that  the  processors  operate  synchronously  and 
they  communicate  with  one  another  by  sending  and  receiving  data  packets  over 
the  communication  links  provided  by  the  network  (Figure  1.2).  Each  processor 
can  set  up  a  single  packet  of  bounded  length  in  a  unit  of  time. 

The  distance  between  two  processors  can  be  defined  in  the  standard  graph- 
theoretic  way,  be.,  the  distance  from  P,  to  P*  is  the  minimum  d  for  which  there 
e.xists  a  sequence  P,  =  P,g,  P,, ,  . . .,  P,^  =  P/t,  where  P,^  is  directly  connected  to 
(or  neighbor  of)  P,^^, ,  0  <  j  <  d.  The  diameter  of  the  network  is  the  maximum 
distance  between  any  two  processors  of  the  network.  The  degree  of  a  processor  is 
the  number  of  its  neighbors.  The  degree  of  the  network  is  the  maximum  degree 
of  any  processor  of  the  network.  Any  degree  k  network  has  diameter  at  least 
log^  p  —  1  [23].  A  network  is  of  bounded-degree  if  its  degree  is  bounded.  Hence, 
any  bounded-degree  network  with  p  processors  must  have  diameter  n(logp). 

Another  important  property  of  the  network  is  related  to  the  graph-theoretic 
notion  of  separators.  Formally,  we  say  that  a  graph  G  has  an  f{n)-separntor 
{f{n)-edgc  separator),  or  is  f{n)-separablc  {f{n)-edge  separable),  if  either  it  has 
only  one  vertex,  or  the  following  two  statements  are  true. 

(1)  Let  Ho  be  the  number  of  vertices  of  G.  Then  there  exist  constants  a  <  1 
and  >  0  such  that  the  removal  of  a  set  of  at  most  iif{u)  vertices  (edg('s) 
disconnects  G  into  two  graphs  G\  and  (72,  of  nj  and  r?2  vertices  each,  such 
that  ni  <  07)0  and  ni  <  (1  —  rv)77o- 
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(2)  Both  G\  and  G2  are  /(n)-separable  (/(n)-edge  separable). 

Note  that  /(n)-edge  separable  graphs  are  /(n)-separable,  but  the  converse  is 
not  true  in  general.  VVe  say  a  family  of  graphs  is  /(n)-separable  (/(n)-edge  sep¬ 
arable)  if  every  member  of  the  family  is  /(n)'Separable  (/(n)-edge  separable). 
Separators  which  achieve  exact  partitioning,  i.e.,  a  =  are  called  strong  sepa¬ 
rators.  Strong  edge  separators  are  also  known  as  bisectors.  For  example,  planar 
graphs  are  \/n-edge  separable  for  a  =  |  and  {3  =  2\/2,  and  trees  are  1 -separable 
for  a  =  I  and  /3  =  1  [81]. 

The  separability  property  of  a  graph  is  important  since  it  can  provide  some 
information  on  the  layout  area  of  the  graph  and  a  lower  bound  for  routing  on 
the  graph.  Note  that  graphs  with  small  separators  tend  to  have  small  layout 
areas  since  there  are  only  small  number  of  edges  to  connect  their  two  separated 
subgraphs,  and  that  they  require  much  time  for  routing  since  it  is  difficult  to 
send  data  from  one  separated  subgraph  to  the  other  because  of  the  lack  in 
communicating  links  between  them. 

We  now  introduce  several  important  network  topologies,  the  complete  net¬ 
work.  the  d-dimensional  mesh,  the  binary  tree  network,  the  hypercube,  the  but¬ 
terfly,  the  cube-connected  cycles  and  the  shuffle-exchange  network,  and  compare 
them  with  respect  to  the  properities  mentioned  above. 

1.3.1  The  Complete  Network 

The  most  general  fixed  interconnection  scheme  is  the  complete  network  in  which 
every  processor  is  directly  connected  to  every  other  processor  (Figure  1.3).  Since 
its  diameter  is  only  one,  it  can  perform  any  permutation  in  one  cycle.  However, 
it  is  physically  unrealistic  for  several  reasons.  An  arbitrarily  large  number  of 
communication  links  can  not  enter  a  processor  because  of  physical  fan-in  limi¬ 
tations,  so  only  very  small  machines  would  be  constructible.  Moreover,  since  its 
degree  is  p  —  1  and  the  number  of  its  communication  links  is  the  space  it 

would  occupy  and  the  length  of  the  longest  communication  link  increases  very 
rapidly  as  p  increases.  The  complete  network  is  interesting  as  a  theoretical  model 
since  algorithmic  lower  bounds  for  this  model  are  automatically  lower  bounds 
for  all  fixed  interconnection  networks. 

1.3.2  The  d-Dimensional  Mesh 

In  a  d- dimensional  x  x  . . .  x  ^/p  mesh,  the  p  processors  may  be  thought 
of  as  logically  arranged  in  a  d-dimensional  ^/p  x  x  . . .  x  array.  The  pro¬ 
cessor  at  location  {id-i,id-2^  ■  ■  ■  Go)  of  the  array  is  connected  to  the  processors 
at  locations  [id-\, .  .  ■  ,i]  ±  l,...,7o),  0  <  j  <  d  —  1.  This  network  has  degree 
2d,  diameter  d{.(yp—  1)  and  a  p‘~ J-bisector.  It  can  perform  permutations  in 
0(d^/p)  cycles  56,78]. 
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Figure  1.3:  Complete  network  of  size  eight 

One  of  the  most  natural  interconnection  schemes  is  the  2-dimensional  y/p  x 
y/p  mesh  (Figure  1.4).  Its  physical  layout  is  straightforward  in  the  2-dimensional 
space.  It  has  degree  four,  diameter  2{y/p  —  1)  and  a  ,yp-bisector.  It  performs 
permutations  in  Q(y/p)  cycles.  The  diameter  can  be  halved  by  including  end- 
around  connections  as  in  the  ILLIAC  IV  [5]. 

1.3.3  The  Binary  Tree  Network 

In  a  binary  tree  network^  the  p  =  2'^  —  1  processors  are  connected  into  a  complete 
binary  tree  with  depth  d  —  \  (Figure  1.5).  Each  non-root  internal  processor  F,, 
2  <  i  <  2“^“*  —  1,  is  connected  to  three  processors,  Pru)  and  where 

L{i)  —  2iy  R{i)  =  2j  -|-  1  and  F{i)  =  [jJ.  The  root  processor  Pi  is  connected 
to  Pj  and  P3  as  its  left  and  right  child  respectively.  The  leaf  processors  P,  are 
connected  to  only  their  fathers  Pf(,),  2“^“*  <  i  <  2*^  —  1. 

This  network  has  a  1-bisector  and  a  2  log2  ^y^-diameter  -  the  distance  from 
a  leaf  up  to  the  root  and  back  down  to  another  leaf.  It  also  has  a  simple  layout 
as  shown  in  the  above  figure.  Unfortunately,  tree  networks  require  linear  time 
to  perform  permutations.  For  example,  assume  it  is  wished  to  move  each  item 
from  the  root’s  left  subtree  to  the  right  subtree,  and  vice  versa.  The  root  is  then 
a  bottleneck  since  it  is  the  only  bridge  between  the  two  subtrees. 

1.3.4  The  Hypercube  Network 

In  a  hypercube  network,  the  p  =  2“^  processors  are  connected  into  a  d-dimensional 
Boolean  cube.  Let  the  binary  representation  of  i  be  . .  .iq,  0  <  i  <  p—  \. 
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Figure  1.6:  4-dimensional  hypercube  (a)  and  switches  in  each  processor  (b) 


Then  processor  P,  is  connected  to  processors  P^,  where  P  =  id-i  . .  .ij . . .  io  and 
ij  =  1  —  Zj,  0  <  j  <  d  —  1.  The  hypercube  has  a  recursive  structure:  a  d- 
dimensional  cube  can  be  extended  to  a  (d  -|-  l)-dimensional  cube  by  connecting 
corresponding  processors  of  two  d-dimensional  cubes.  One  has  the  highest-order 
address  bit  0  and  the  other  has  the  highest-order  address  bit  1  (Figure  1.6(a)). 

This  network  has  diameter  d  =  logp,  for  example,  the  distance  between 

Pq  and  Pp-i,  and  a  strong  T  -separator.  Since  its  degree  is  logp  and  the 

v/iogp 

total  number  of  its  communication  links  is  d-2‘*“\  its  layout  area  which  is  0(p^) 
would  grow  more  rapidly  than  similar  networks  such  as  the  shuffle-exchange  and 
the  cube-connected  cycles.  It  can  perform  an  arbitrary  permutation  in  0(logp) 
cycles  [.35,86]. 

The  hypercube  architecture  has  many  interesting  topological  and  graph- 
theoretic  properties  that  make  it  a  very  good  candidate  for  parallel  processing. 
Actually  several  hypercube  networks  have  been  available  commercially  for  some 
time. 
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The  hypercube  network  that  will  be  used  in  the  rest  of  this  thesis  consists  of 
p  =  2“^  synchronous  processors.  Two  different  hypercube  models,  the  pipelined 
hypercube  model  and  the  weak  hypercube  model,  will  be  used. 

In  the  pipelined  hypercube  [82],  there  are  d  switches,  (0,f),  . . .,  (d  — 

1,  i),  in  each  processor  Pi,  0  <  i  <  p—  1.  The  switches  are  connected  by  a  shared 
bus  to  the  processing  unit  of  the  processor.  Each  switch  (j,  f),  0  <  j  <  d  —  1,  is 
connected  by  a  bidirectional  intra- processor  link  to  switch  ((i+  1)  mod  d,  i).  and 
by  a  bidirectional  inter-processor  link  to  switch  {j,P)  (Figure  1.6  (b)).  .A.  cycle 
of  a  switch  consists  of  an  odd  phase  and  an  even  phase.  The  odd  phase  consists 
of  data  transfer  between  switches  in  the  same  processor  along  the  intra-processor 
link.  In  the  even  phase,  data  is  transferred  between  different  processors  using 
the  inter-processor  link. 

The  switches  form  a  synchronous,  pipelined  packet-switched  network  that 
is  used  to  transfer  blocks  of  data  between  the  processors.  A  packet  consists 
of  a  constant  number  of  data  elements.  Three  types  of  communication  traffic. 
forward  routing,  reverse  routing  and  cube  routing,  that  arise  in  all  the  algorithms 
in  this  thesis  must  be  supported  by  the  network. 

In  forward  routing,  communication  during  the  odd  phase  is  from  switch  {j.  i) 
to  ((j  +  1)  mod  d,i),  while  for  reverse  routing  it  is  from  (j,f)  to  ((;  — 1)  mod  dj). 
On  receiving  a  packet,  switch  {j,i)  decodes  the  destination  address  associated 
with  the  packet,  and  buffers  it  for  transmission  on  either  the  intra-processor  link 
or  the  inter-processor  link  to  P)  as  appropriate.  If  the  packet  is  buffered  on  the 
intra-processor  link,  the  packet  will  be  transferred  to  the  switch  ((i  +  1)  mod  d.  i) 
oi'  {{J  ~  1)  mode/, I)  in  the  odd  phase  of  the  next  cycle.  Otherwise,  it  will 
be  transferred  to  {j,P)  in  the  even  phase  of  the  current  cycle.  Cube  routing 
is  employed  to  emulate  the  point-to-point  connections  of  the  hypercube.  We 
require  at  most  one  switch  of  a  processor  to  send  or  receive  a  packet  to  or 
from  the  processing  unit  of  the  processor  in  the  same  cycle.  Thus  a  shared  bus 
between  the  processing  unit  and  the  switches  in  each  processor  represents  an 
adequate  connection. 

In  the  weak  hypercube  [61],  each  processor  is  allowed  to  send  or  receive  at 
most  one  packet  and  perform  a  constant  number  of  local  computations  in  a 
single  time  step.  We  assume  that  the  instruction  format  does  not  restrict  all 
packets  to  cross  the  same  dimension  in  a  given  time  step.  Clearly,  this  model  is 
weaker  in  communication  than  the  pipelined  hypercube  model. 

1.3.5  The  Butterfly  Network 

The  butterfly  network  is  an  interconnection  system  most  frequently  associated 
with  Fast  Fourier  Transform.  In  general,  it  consists  ol  p  =  \g  +  1)2''  processors, 
organized  as  7 -1-  1  ranks  of  2''  processors  each  (Figure  1.7).  Optionally,  we  shall 
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crtceisor  3 
:r:jessor  : 
crocessor  C 
srccesscr  3 
srccessor  4 
processor  5 
srccessor  -6 
or.'cessor  " 


Figure  1.7:  Butterfly  network  for  <7  =  3  and  p  =  A  ■  2^ 

identify  the  rightmost  and  the  leftmost  ranks,  so  there  is  no  rank  q,  and  the 
processors  on  ranks  0  and  <7  —  1  are  connected  directly. 

Let  us  denote  the  processor  i  on  the  rank  r  by  P,,r,  0  <  ?  <  2'',  0  <  r  <  ^. 
Then  processor  P,,r+i  is  connected  to  the  two  processors  P,,t  and  P,r  and 
processor  P.^.r+i  is  connected  to  the  two  processors  P,  r  and  P.r.r.  Recall  that 
P  =  . . .  ir . . .  io-  These  four  connections  form  a  “butterfly'’  pattern,  from 

which  the  name  of  the  network  is  dv..ived. 

The  hypercube  is  actually  the  butterfly  with  the  rows  collapsed.  The  com¬ 
munication  link  in  the  hypercube  between  processors  P,  and  P,r  is  identified  with 
che  communication  links  in  the  butterfly  between  P,,r+i  and  P,r,r,  and  between 
P,r,r+i  and  P,,r. 

1.3.6  The  Cube-Connected  Cycles 

The  cnbe-connectrd  cycles  is  a  network  of  p  =  2'^  identical  processors,  where 
d  =  I  +  2‘ .  When  d  is  arbitrary,  /  is  the  smallest  integer  for  which  /  -f  2^  >  d. 
and  the  resulting  modifications  are  straightforward.  Each  processor  has  a  d-bit 
address  m,  which  in  turn  is  expressed  as  a  pair  (i,r)  of  integers  represented  with 
[d  —  1)  and  /  bits,  respectively,  such  that  i2^  r  =  m  [62].  This  network  consists 
of  ^  cycles  of  length  2^  and  those  cycles  are  connected  as  a  2Ldimensional 
Boolean  cube.  Lei  F{i,r)  =  {i,{r  +  1)  mod  2'),  =  («,(r  -  1)  mod  2') 

and  L{i,r)  =  (F,r).  Then,  each  processor  P(,,r)  of  the  cube-connected  cycles 
has  three  neighbors  P/.(,  r),  Pb(i.t)  ^L(t.r)  (Figure  1.8).  Processor  P(,.r)  i^^ 
connected  to  processors  PF{>.r)  and  Pnii.r)  around  the  cycle,  and  to  pro<  essor 
/V,(,,r)  across  the  cube. 
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Figure  l.S:  Cube-connected  cycles  for  p  =  3  •  2'^ 

The  p-processor  cube-connected  cycles  can  be  considered  as  the  p-processor 
butterfly  with  processors  in  rank  0  identified  with  processors  in  rank  2‘ .  The 
cycles  of  the  cube-connected  cycles  are  exactly  the  cycles  of  the  butterfly.  The 
only  detail  necessary  to  relate  the  butterfly  to  the  cube-connected  cycles  is  that 
in  the  cube-connected  cycles,  processor  P{i.r)  is  connected  to  while  in  the 

butterfly  processor  P,^r  is  connected  to  P,r  r+\-  However,  by  following  a  pair 
of  links  in  the  cube-connected  cycles,  we  can  get  to  Ppr  from  we  go 

across  the  cube  to  P(,r,r)  and  then  around  the  cycle  to  P(,r  ^+1). 

The  cube-connected  cycles  has  degree  three,  diameter  0(logp)  and  a 

2 

bisector,  and  its  layout  has  area  Q( H  can  perform  an  arbitrary  permu¬ 
tation  in  0(logp)  cycles  [62]. 


1.3.7  The  Shuffle-Exchange  Network 


The  shujjlc-rxcliangt'  netu'ork  is  based  on  the  pcrfrrt  sliuJJJr  and  the  exchange 
inlercoiiiK'ctions  [75].  Define  PS{i)  and  T’.Yf/),  0  <  ;  <  p  =  2'^.  as  follows: 


PS(,)  = 


'll  if  i 

2f  —  p  -f  I  ot  herwise 


/TVfO  =  I 

1  hen  PS~'  can  be  (h'seribed  as 

ps-'{i)  =  I 


^  -P  1  if  /  is  even, 
i  —  1  otherwise. 


i 

I- 1 


if  /  is  ('ven, 
ol  herwise. 


Figure  1.9:  Shuffle-exchange  network  of  size  eight 


If  id~iid~2  ■  •  ■  *0  denotes  the  binary  representation  of  e,  then  PS{i)  and  PS~^(i) 
correspond  to  the  left  rotation  and  right  rotation  of  i  one  position  respectively 
as  follows; 

P S{id-iid-2  •  •  •  io)  =  id-2  •  ■  •  ioid-i 


and 


PS  ^{id-iid-2  ■  ■  -  io) 


—  ioid-\  •  ■  -  ii- 


In  the  shuffle-exchange  network,  each  P,  has  three  neighbors,  Pex(i)-,  Pps(x)  and 

Pps-^i)  (Figure  1.9).  It  has  diameter  approximately  2  log2  p  and  a  j^^^-bisector. 
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and  its  layout  has  area  It  can  perform  an  arbitrary  permutation  in 

0(logp)  cycles  [71]. 

Table  1.1  shows  the  the  asymptotic  formulas  for  the  various  quantities  asso¬ 
ciated  with  the  fixed  interconnection  networks  introduced. 


1.4  Hypercube  Algorithms 

In  this  section,  we  introduce  several  fundamental  hypercube  algorithms  that  will 
be  used  in  the  rest  of  the  thesis. 

1.4.1  Normal  Algorithms 

There  are  three  classes  of  algorithms  for  the  weak  hypercube:  leveled  algorithms, 
which  use  communication  links  in  only  one  dimension  at  a  time,  but  in  arbitrary 
order;  normal  algorithms,  which  arc  leveled  algorithms  subject  to  the  additional 


degree 

mm 

dia¬ 

meter 

separator 

cycles 

(permu¬ 

tations) 

layout 

area 

complete 

network 

P  -  1 

1 

P^ 

1 

0(p^) 

2-D 

mesh 

4 

2p 

2v/p 

Vp 

0(p) 

0(p) 

binary 

tree 

3 

P 

2  logp 

1 

0(p) 

0(p) 

hyper- 

cube 

logp 

ploup 

2 

logp 

0 

0(logp) 

0(r) 

cube-co. 

cycles 

3 

1.5p 

2  logp 

logp 

0(logp) 

shuffle 

exchange 

3 

1.5p 

2  logp 

logp 

0(log  p) 

Table  1.1:  Comparisons  among  the  fixed  interconnection  networks 


condition  that  consecutive  dimensions  are  used  at  consecutive  time  steps:  and 
fulhj  normal  algorithms,  which  are  normal  algorithms  subject  to  the  additional 
condition  that  all  d  dimensions  of  the  hypercube  are  used  in  sequence. 

Theorem  1.2  [62,70,81]  Normal  algorithms  can  be  simulated  on  the  shuffle- 
exchange,  the  cube-connected  cycles,  or  the  butterfly  with  only  a  constant  slow¬ 
down.  □ 

Example  1.1:  Let  p  =  2*^  elements  {ao,  Oi,  ■  •  Op-i}  and  a  binary  associative 
operator  *  be  given.  Suppose  we  have  a  p-processor  weak  hypercube  such  that  <i, 
is  stored  in  processor  P,,  0  <  f  <  p  -  1.  Then  prefix  sums  computation  consists 
of  evaluating  the  p  partial  sums  Sj  =  ao*ai*---*aj,0<j<p— There  is  a 
fully  normal  algorithm  to  solve  the  problem. 

1 1  ^  a, , 

for  j  «—  0  to  d  -  1  do  in  parallel 
if  j  >  i  T  S-*  {0:  Exclusive  or} 
then  s,  <—  t„^2i  *  ■‘’i:  li  ^ 
el.se  t,  e-  t,  *  t„^2)’  □ 

Example  1.2:  Assume  that  there  is  an  array  A  =  (eo,«i . ^p-i)  such  that 

r;o  <  .  .  .  <  Up/2_,  and  0^/2  >  ...  >  Up-i-  Suppose  that  n,  stored  in  processor 


Pi,  Q  <  i  <  p  —  1.  Then  there  is  a  fully  normal  algorithm  to  merge  the  array  A. 
The  algorithm  is  referred  to  as  bilonic  merge  algorithm  [6,39]. 

for  i  e-  </  —  1  to  0  do  in  parallel 

if  U  <9  0  2'  and  aj  >  or  U  >  j  0  2‘  and  Oj  < 

then  o, j  ^  C[j02*  * 

Thus,  an  arbitrary  array  A  =  (uq,  <3i,  •  •  • ,  o.i  stored  in  processor  P,,  can  be 

sorted  in  steps  on  the  weak  hypercube  by  applying  the  merge  algorithm 

to  each  subcube  of  size  2T  increasing  j  from  1  to  d  {bitonic  sort).  Note  that  the 
sorting  algorithm  is  normal.  □ 

Since  many  powerful  techniques  have  been  developed  for  designing  efficient 
parallel  algorithms  on  the  PRAM  model,  it  is  important  to  develop  an  effi¬ 
cient  step  by  step  simulation  of  a  PRAM  algorithm  on  the  fixed  interconnection 
networks.  Using  the  above  sorting  algorithm,  we  can  show  that  any  PR.AM 
algorithm  can  be  simulated  with  (9(log^p)  delay  on  the  weak  hypercube  under 
certain  conditions  stated  in  the  next  theorem. 

Theorem  1.3  Let  a  PRAM  algorithm  require  t  steps  and  m  memory  locations 
on  any  p-processor  CRCW  PRAM.  Then  this  algorithm  can  be  implemented  to 
run  in  time  0(t  ■  log^  p)  on  the  p-processor  weak  hypercube  whenever  m  =  0{p). 

Proof:  There  are  two  CRCW  operations  that  need  to  be  simulated  on  the 
hypercube:  the  concurrent  read  and  the  concurrent  write.  The  concurrent  read 
operation  can  be  simulated  as  follows. 

(1)  Sort  the  read  requests  according  to  their  destination  addresses. 

(2)  Choose  only  one  read  request  for  each  destination  address. 

(3)  Distribute  the  picked  requests  to  their  destinations. 

(4)  Read  the  data. 

(5)  Return  the  data  to  the  positions  of  the  requesting  packets  in  step  (2). 

(6)  Broadcast  the  data  to  the  requests  with  the  same  destination  addresses. 

(7)  Return  the  data  items  to  their  original  processors. 

All  the  above  steps  can  be  performed  on  the  hypercube  in  O(log^p)  steps  by 
the  sorting  algorithm  and  the  prefix  sums  algorithm.  The  concurrent  write 
operation  can  be  simulated  in  a  similar  way.  Among  the  write  requests  with  the 
same  destination,  only  one  request  is  chosen  according  to  the  assumption  of  the 
concurrent  writing.  □ 

Since  normal  algorithms  can  be  simulated  on  the  shuffle-exchange,  the  cube- 
connected  cycles,  or  the  butterfly  with  only  a  constant  slowdown,  the  following 
corollary  follows. 

Corollary  1.1  Let  a  PRAM  algorithm  require  t  steps  and  m  memory  locations 
on  any  p-processor  CRCW  PRAM.  Then  this  algorithm  can  be  implemented 
to  run  in  time  0{t  ■  log^  p)  on  the  p-processor  shuffle-exchange,  cube-connected 
cycles  or  butterfly  whenever  m  =  0(p).  □ 
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F  graph  (LS3)  r  graph  (MSB) 

Figure  1.10:  The  F  and  F  graphs  for  p  =  2^ 

1.4.2  Butterfly  Communication  Graphs 

A  butterfly  communication  graph  is  a  directed  graph  whose  vertices  represent 
switches  and  whose  edges  represent  unidirectional  communication  links  between 
the  switches.  Vertices  with  no  incoming  (outgoing)  edges  will  be  called  sources 
(sinks).  VVe  define  two  butterfly  communication  graphs,  F  and  R,  on  which  the 
required  traffic  patterns  are  proved  to  be  conflict-free.  We  then  show  that  a 
conflict-free  set  of  routes  in  either  F  and  R  corresponds  to  conflict-free  routing 
on  the  pipelined  hypercube. 

Both  F  and  R  have  p{d  -f  1)  vertices  arranged  in  d  -|-  1  levels,  with  p  =  2'^ 
vertices  at  each  level.  .A  vertex  is  denoted  by  (f,  f),  where  I  is  the  level  number. 
0  <  I  <  d,  and  i  is  the  index  of  the  vertex  w’ithin  the  level,  0  <  f  <  p  —  I. 
In  F,  a  vertex  (l.i)  at  level  /,  0  <  /  <  d  —  1,  is  connected  to  the  two  vertices 
(/  -b  l,f)  and  (/  +  l,f')  by  edges  directed  from  the  former  into  the  latter.  In  R. 
the  vertex  {l,i)  is  connected  to  two  vertices  [I  -f  l,t)  and  (/  +  1,  ).  F  will 

be  referred  to  as  the  forward  network  and  R  will  be  referred  to  as  the  reverse 
network  (Figure  1.10). 

A  switch  at  level  /,  0  <  /  <  d,  examines  a  bit  of  the  address  associated  with  a 
packet,  and  passes  it  at  the  next  cycle  to  a  switch  at  level  /-f  1.  We  describe  two 
routing  operations  that  the  switches  support,  namely  least  significant  bit  (LSD) 
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routing  and  most  significant  bit  (MSB)  routing.  The  switches  in  F  employ  LSB 
routing  while  those  in  R  employ  MSB  routing. 

In  LSB  routing,  vertex  (/,  z)  of  F,  0  <  /  <  d  —  1,  routes  a  packet  to  either 
vertex  (/  +  l,i)  or  vertex  (/  +  l,t^)  depending  on  whether  the  /-th  bit  of  the 
address  field  Af  oi  the  packet  matches  the  /-th  bit  of  i  or  not,  respectively.  In 
MSB  routing,  vertex  (/,  i)  of  /?,  0  <  /  <  d  —  1,  routes  a  packet  to  either  vertex 
(/  -f  l,i)  or  to  vertex  (/  +  depending  on  whether  the  [d  —  I  —  l)-th  bit 

of  the  address  field  An  of  the  packet  matches  the  {d  —  I  —  l)-th  bit  of  i  or  not, 
respectl\  ^ly. 

The  switches  in  R  also  support  a  variant  of  MSB  routing  referred  to  as  MSB 
routing  with  copy.  This  is  used  to  implement  a  broadcast  facility,  in  which  a  data 
packet  can  be  sent  simultaneously  from  a  vertex  (0,/)  to  all  the  consecutively 
indexed  destination  vertices,  (d, a,),  (d, a,  -t-  1),  . . .,  (d,  6,  —  1),  (d, /),).  The 
address  field  Ar  now  consists  of  the  pair  in  integers  (a,,  6,),  a,  <  bi,  which  define 
the  limits  within  which  the  packet  must  be  sent.  Each  switch  vertex  (l.j), 
0  <  /  <  d  —  1,  performs  the  following  actions  on  receiving  a  packet  of  this  form. 
If  the  (d  —  /  —  l)'th  bits  of  a,  and  6,  are  the  same,  the  vertex  implements  the 
usual  MSB  routing  to  route  the  packet  to  the  vertex  indicated  by  the  address 
a,.  If  the  two  bits  are  different,  then  the  packet  is  forwarded  to  both  the  vertices 
(/ -f- 1,  j)  and  (/  +  1,)'^“^“*).  However,  the  addresses  a,  and  6,  that  are  forwarded 
to  the  two  vertices  are  updated  as  follows.  The  copy  forwarded  to  the  vertex 
with  the  smaller  index  will  have  6,-  set  to  2*^  —  1,  and  that  forwarded  to  the  vertex 
with  the  larger  index  will  have  a,-  set  to  zero. 

The  route  in  F  {R)  from  vertex  (0,/)  to  vertex  (d,j)  is  the  ordered  sequence 
of  vertices  in  F  (/?),  ((0,i),  (l./i),  . . (d./j  =  j)),  that  a  packet  with  address 
Ap  —  j  {Ar  =  j)  passes  through.  The  sequence  of  edges  between  vertices  in 
the  route  is  the  path  of  the  route.  A  route  in  F  is  referred  to  as  forward  route. 
while  a  route  in  R  is  referred  to  as  reverse  route.  Two  routes  are  said  to  be 
conflict-free  if  they  are  vertex  disjoint.  A  set  of  routes  are  conflict-free  if  they 
are  pairwise  vertex  disjoint. 

VVe  now  relate  F  and  R  to  the  pipelined  hypercube,  and  show  how  conflict- 
free  routes  in  F  or  F  imply  link-disjoint  routes  in  the  pipelined  hypercube. 

In  the  following,  let  Hp  (Hr)  refer  to  the  graph  obtained  from  F  (R)  by 
replacing  the  directed  edge  <  (/,  u),  (/  -f  1,  u^)  >  (<  (/,  u),  (/  -f  1,  >)  with 

the  directed  edge  <  (/  -f  1,  u),  {I  +  l.u‘)  >  (<  (/  -f-  1,  u),  (/  -(-  1,  >  (Figure 

1.11).  Notice  that  Hp  and  Hr  maps  directly  onto  the  switches  and  links  of  the 
hypercube  used  for  forward  routing  and  reverse  routing,  respectively. 

A  route  in  Hp  is  obtained  from  a  route  in  F  by  replacing  every  edge  < 
(/,«),(/  -f-  l.td)  >  by  the  two  directed  edges  <  (/,  u),(/  -|-  l,u)  >  and  <  (/  -|- 
!,»),(/  -j-  l,id)  >,  0  <  /  <  d  —  1.  Similarly,  A  route  in  Hr  is  obtained  from  a 
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Figure  1.11;  The  Ilf  and  Hr  corresponding  to  F  and  R  of  Figure  1.10 

route  in  R  by  replacing  every  edge  <  (/,  u),(/+  >  by  the  two  directed 

edges  <  (/,  u),  (/  4-  1,  u)  >  and  <(/+!, u),(/+l,  >,  0  <  /  <  d  —  1. 

Theorem  1.4  [82]  Let  Ri  and  R2  be  the  paths  of  two  vertex  disjoint  routes  in 
F  (R)  and  R[  and  R'.^  be  the  corresponding  paths  in  Ilf  (Hr).  Then  R\  and  R'^ 
are  edge  disjoint. 

Proof:  Assume  by  way  of  contradiction  that  an  edge  e  tliat  is  common  to  R\ 
and  R'2  exists.  If  e  =<  (t,u),(i+  l.n)  >,  then  vertex  (/,  u)  is  common  to  both 
Ri  and  /?2.  If  e  =<  (/.  u),  (I,  u')  >.  wliere  u'  =  id”’  or  according  to  whether 
/?!  and  /?2  are  from  F  or  R.  then  vertex  (/  -  l.n)  is  common  to  both  /?]  and 
Ri-  O 

Corollary  1.2  Let  .S',  be  a  set  of  conflict-free  routes  in  F  or  R  and  Si  be  corre¬ 
sponding  routes  in  the  hgpercube.  Then,  routes  in  S2  are  pair-wise  link  disjoint 
in  the  hgperriihe.  □ 

1.4.3  Conflict-Free  Routing 

We  now  di'seribe  several  routing  patterns  that  arise  throughout  this  thesis,  and 
show  that  they  are  conflict-free  in  F  or  R.  .As  a  consequence  of  Tln'orem  l.d 
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and  Corollary  1.2,  these  routings  can  be  performed  without  link  conflict  in  the 
hypercube  network  using  LSB  or  MSB  routing. 

Lemma  1.1  [S2]  Let  (/,  n)  be  a  node  on  the  route  from  (0,z)  to  {d,j)  in  F  (R). 
Then  the  binary  representation  of  u  is  id-i  ■  ■  ■  iiji-\  ■  ■  ■  jo  (jd-i  ■  ■  ■  jd-lid-i-i  ■  ■  -  ioJ 

Proof:  Direct  consequence  of  LSB  (MSB)  routing.  □ 

Lemma  1.2  [S2]  Let  ((0,s,),  (d,  t,)),  0<2<r—  1,  be  a  collection  of  r  pairs 
such  that,  0  <  So  <  Si  <  . . .  <  Sr_i  <2*^  —  l,0<<o<^i<---<  C-i  <  2“^  —  1 , 
and  Sj+i  —  s,  >  C+i  ~  for  all  i,  0  <  i  <  r  —  2.  Then  the  set  of  routes  in  F 
from  vertex  (0,s,)  to  the  vertex  {d,t,)  is  conflict-free. 

Proof:  Let  i  and  j  be  such  that  0<J<^<^  —  1.  Let  u  =  s,,v  —  Sj.x  =  t,  and 
y  =  ty  Assume  by  w’ay  of  contradiction  that  (/,  te)  is  a  vertex  that  is  common 
to  the  two  routes  (u,  j)  and  {v,y).  From  Lemma  1.1,  u;  =  uj_i  .  .  .  iqx/_i  .  .  .  Xq 
I'd-i  ■  ■  ■  oiyi-\  . .  .yo-  Thus,  u  —  v  <  2‘  and  x  —  y  >2‘ ,  which  contradicts  the  fact 
that  u  —  V  >  X  —  y.  Since  i  and  j  were  arbitrary,  the  set  of  routes  is  conflict-free. 
□ 

The  special  case  of  this  lemma,  where  L  =  i,  is  known  as  concentrate  routing. 
A  similar  lemma  holds  for  the  routes  in  R. 

Lemma  1.3  [82j  Let  ((0,  s,),  (d,  L)).  0  <  i  <  r  —  1,  be  a  collection  of  r  pairs 
such  that,  0  <  So  <  Si  <  . . .  <  Sr_i  <  2*^  —  1 ,  0  <  io  <  <  •  •  •  <  L-i  <  2*^  —  1, 

and  s,+i  —  s,  <  t,+i  —  t,,  for  all  i,  0  <  i  <  r  —  2.  Then  the  set  of  routes  in  R 
from  vertex  (0,s,)  to  the  vertex  {d,t,)  is  conflict-free.  □ 

The  special  case  of  this  lemma,  where  s,  =  i,  is  known  as  spread  routing. 

Broadcast  routing  is  defined  as  follows.  Let  {(0,2)10  <  2  <  r  —  1 }  be  a  set 
of  sources  in  R.  Associated  with  each  source  (0,2),  is  a  pair  of  integers,  a,  and 
6,  such  that  a,  <  b,.  Let  a,+i  >  b„  for  all  2,  0  <  2  <  r  —  2.  and  br-i  <  p  —  1. 
Then  Broadcast  routing  is  to  route  data  from  (0,2)  to  all  (d,  22),  a,  <  u  <  b,. 
Broadcast  routing  can  be  performed  using  .MSB  with  copy  in  R. 

Lemma  1.4  [82]  Let  (0,2)  and  (0,j)  be  two  sources  involved  in  a  broadcast 
routing.  Then  the  routes  from  (0. 2)  to  (d,u)  and  from  {O.j)  to  (d,  r).  for  any  a 
and  u,  a,  <  22  <  b,  and  Oj  <  v  <  bj  are  conflict-free.  □ 

Corollary  1.3  Broadcast  routing  is  conflict-free.  □ 

Note  that  all  the  above  conflict-free  routings  can  be  performf'd  by  using  fully 
normal  algorithms.  We  later  prove  that  this  kind  of  routings  can  be  performed 
optimally  on  the  j)ipelined  hypercube  since  their  paths  are  conflict-frc'e  and 
all  the  links  of  the  hypercTibe  can  be  used  simultaneously. 
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Chapter  2 


Load  Balancing,  Sorting  and 
Routing 


2.1  Introduction 

Consider  any  of  our  networks  in  the  case  when  the  input  size  n  is  larger  than 
the  number  of  processors  p.  Compared  with  the  PRAM  model,  there  are  two 
main  drawbacks:  (1)  no  two  processors  can  access  the  same  memory  module 
simultaneously  and  hence  memory  conflicts  should  be  avoided,  and  (2)  there 
is  an  Cl(logp)  cost  for  two  arbitrary  processors  to  communicate.  An  efficient 
algorithm  should  maintain  a  balance  between  local  computation  and  communi¬ 
cation,  and  should  arrange  the  data  dynamically  in  such  a  way  that  memory 
conflicts  are  avoided.  Such  algorithms  have  been  developed  for  several  basic 
data  broadcasting  and  communication  problems  [31,32,34,35,37,65,66],  numeri¬ 
cal  computing  problems  [31,32,37],  and  sorting  [18,61,82]. 

We  address  in  this  chapter  several  problems  related  to  load  balancing,  sorting 
and  routing  on  the  hypercube,  the  shuffle-exchange,  the  cube-connected  cycles 
and  the  butterfly.  These  problems  are  important  on  their  own  and  are  funda¬ 
mental  to  fast  implementation  of  parallel  algorithms  on  these  networks.  Our 
contribution  is  two-fold.  First,  we  provide  new  algorithms  to  handle  these  prob¬ 
lems.  In  most  of  the  cases,  our  algorithms  are  efficient  under  certain  conditions. 
For  example,  our  algorithm  for  routing  n  packets  on  the  p-processor  hypercube 
is  efficient  whenever  n  =  ^(p*''''),  for  some  positive  constant  e.  Second,  we 
shed  some  insight  into  the  relationship  betw’een  these  different  networks.  For 
example,  we  establish  that  load  balancing  can  provably  be  solved  faster  on  the 
weak  hypercube  than  on  the  shuffle-e.xchange,  the  cube-connected  cycles  or  the 
butterfly. 

The  rest  of  the  chapter  is  organized  as  follows.  Some  results  on  basic  commu¬ 
nication  schemes  needed  for  the  rest  of  this  thesis  are  given  in  the  next  section. 
Load  balancing  and  sorting  are  considered  in  sections  2.3  and  2.4  respectively. 
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while  the  algorithms  for  the  general  packet  routing  problem  are  presented  in  sec¬ 
tion  2.5.  The  last  section  is  devoted  to  the  relationship  between  our  networks 
and  the  CRCVV  PRAM. 


2.2  Basic  Communication  Schemes 

Recall  that  a  fully-normal  algorithm  on  the  hypercube  with  p  =  2'^  processors 
consists  of  d  stages  such  that  the  stage  i  involves  the  communication  links  in 
the  dimension  i  or  in  the  dimension  {d  —  i  —  1)  of  the  hypercube,  0  <  ?  < 
d  —  I,  and  proceeds  from  the  least  significant  bit  to  the  most  significant  bit 
(LSD)  or  vice-versa  (MSB),  respectively  (section  1.3).  It  turns  out  that  many 
computational  and  routing  problems,  such  as  Fast  Fourier  Transform,  prefi.x 
sums,  odd-even  and  bitonic  merge,  matrix  transpose,  conflict-free  routing,  and 
any  fixed  permutation,  can  be  solved  optimally  by  such  an  algorithm.  The 
existence  of  such  simple  optimal  algorithms  has  stimulated  initial  interest  in 
the  hypercube  model.  In  this  section,  we  will  review  several  important  cases  of 
the  routing  problem  on  the  hypercube  that  can  be  solved  optimally  by  a  fully 
normal  algorithm  and  develop  the  necessary  background  needed  for  the  rest  of 
the  thesis.  We  are  interested  in  the  ca.se  when  the  number  n  of  data  items  could 
be  much  larger  than  the  number  p  of  processors. 

A  simple  routing  problem  consists  of  a  set  of  n  packets,  each  of  which  has 
a  source,  a  destination  and  a  data  item  to  be  moved  from  the  source  proces¬ 
sor  to  the  destination  processo''.  Let  <  >  denote  an  arbitrary  packet, 

where  i  is  the  source,  ti  is  the  destination,  and  x,  is  a  data  item.  Suppose 
we  know  how  to  solve  a  special  instance  of  this  routing  problem  optimally  on 
an  n-processor  hypercube.  We  are  interested  in  mapping  the  algorithm  into  a 
p-processor  hypercube.  We  define  the  corresponding  routing  problem  on  the 
p-processor  hypercube  as  follows.  Let  i  =  {^)qi  +  r,  where  0  <  r  <  -  —  1. 
Similarly  define  qt,.  Then  replace  packet  <  >  by  <  qi,qt,,Xi  >.  Each 

processor  is  now  the  source  of  ^  packets.  The  following  simple  observation  will 
have  important  implications. 

Lemma  2.1  Suppose  that  a  simple  routing  problem  with  n  packets  can  be  solved 
on  an  n-processor  hypercube  by  using  a  fully-normal  algorithm.  Then  the  corre¬ 
sponding  problem  can  be  solved  in  time  0{^ -\-  logp)  on  a  p-processor  pipelined 
hypcrcube,  where  p  <  n. 

Proof;  Let  n  =  2'^'  and  p  =  2“^^.  Without  loss  of  generality,  assume  that  the 
given  routing  problem  can  be  solved  by  using  the  LSD  routing  on  the  7?-processor 
hypcrcube  //.  We  will  emulate  this  strategy  on  the  p-processor  hypcrcube  II' . 
The  first  di  —d^  stages  of  the  algorithm  involve  data  movements  within  the  local 
memories  of  II'.  The  {di  —  ^2  T  l)-th  stage  involve  moving  possibly  n  packets 
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along  the  (c/i  —  £^2  +  l)-th  dimension  of  H.  H'  will  pipeline  these  requests  starting 
with  the  first  dimension.  Processor  P,  sends  its  initial  packet  to  switch  (0,t), 
0  <  2  <  p  —  1,  in  the  odd  phase  of  the  (dj  —  ^2  +  l)-th  stage.  On  receiving 
the  packet,  switch  (0,  i)  decodes  the  destination  address  associated  with  the 
packet,  and  buffers  it  for  transmission  on  either  the  intra-processor  link  or  the 
inter-processor  link  to  switch  (0,f°).  If  the  packet  in  switch  (0,  f)  is  buffered  on 
the  inter-processor  link,  it  will  be  transferred  to  switch  (0,  i®)  and  buffered  on 
the  intra-processor  link  to  switch  (l,f°)  in  the  even  phase.  In  the  odd  phase 
of  the  (di  —  ^2  -f  2)-th  stage,  the  initial  packets  will  be  transferred  to  switches 
(l,i),  0  <  i  <  d2  —  1,  while  the  next  p  packets  will  be  sent  to  switches  (O.f). 
0  <  ?  <  ^2  —  1.  At  the  di-th  stage  of  //,  all  the  links  in  H'  are  busy  handling 
the  pipelined  packets.  Since  we  are  allowing  pipelining,  the  routing  defined  on 
H'  is  legal.  Therefore  the  lemma  follows.  □ 

The  assumption  of  the  pipelined  communication  indicated  in  the  above 
lemma  is  crucial  as  we  will  show  now.  We  will  describe  a  routing  problem 
which  can  be  solved  optimally  using  a  fully-normal  algorithm,  and  yet  cannot 
be  handled  within  the  time  bound  stated  in  the  above  lemma  on  the  weak  hy¬ 
percube. 

Let  p  =  2^^^,  for  some  positive  integer  d.  Let  E{i)  =  (f  +  0101 . . .  OI2)  mod  p. 
0  <  f  <  p—  1-  For  example,  if  p  =  2^,  then  £^(00)  =  01,  £’(01)  =  10,  £(10)  =  11, 
and  £(11)  =  00. 

Remark  2.1  The  Hamming  distance  between  i  and  E{i)  is  no  less  than  d,  for 
any  i,  0  <  i  <  p  —  1. 

Proof;  The  claim  is  obvious  if  d  =  1.  Assume  that  d  >  1.  One  can  easily  check 
that  i  and  E{i)  differ  in  at  least  one  bit  position  of  the  two  most  significant  bits. 
The  claim  follows  by  induction.  □ 

Lemma  2.2  Consider  the  routing  problem  on  a  p-processor  hypercube,  where 
processor  Pi  has  to  send  -  data  items  to  processor  forO  <  i  <  (1010  . . .  IO2). 
Then  this  problem  can  be  solved  in  time  0{j  +  logp)  on  the  pipelined  hypercube 
by  using  a  fully-normal  algorithm.  However,  on  the  weak  hypercube,  it  requires 
time. 

Proof:  The  paths  that  send  items  from  £,  to  0  <  i  <  (1010...  IO2).  are 

conflict-free  by  Lemma  1.2,  and  the  problem  can  be  solved  by  a  fully-normal 
algorithm.  Thus,  by  Lemma  2.1,  it  can  be  solved  in  time  0(^  +  logp)  on  the 
pipelined  hypercube.  However,  the  total  number  of  data  movements  is  f2(n  logp) 
since,  by  the  above  remark,  each  item  must  pass  n(logp)  communication  links 
to  get  to  its  proper  destination.  Thus,  0(2^^)  steps  are  necessary  on  the  weak 
hypercube  since  only  one  link  in  each  processor  can  be  used  at  each  time  step. 
□ 


2.5 


A  routing  problem  of  n  packets  on  a  p-processor  hypercube  will  be  viewed  as 
a  routing  problem  on  an  n-processor  hypercube.  A  fully-normal  algorithm  will 
be  found,  and  then  using  Lemma  2.1,  a  solution  on  the  p-processor  hypercube 
will  be  obtained. 

The  four  important  special  routing  problems,  concentrate,  broadcast,  spread 
and  collect,  can  be  restated  a.s  follows.  Assume  that  each  processor  R,,  has  a 
block  i?,,  0  <  i  <  r  —  1,  where  sq  <  5i  <  . . .  <  and  \Bi\  =  t. 

The  concentrate  routing  consists  of  sending  the  block  in  P,,  to  P,  0  < 
2  <  r  —  1.  By  Lemma  1.2,  the  concentrate  routing  can  be  performed  by  using 
a  fully-normal  algorithm  on  a  u-processor  hypercube.  Therefore  by  Lemma  2.1, 
the  concentrate  routing  can  be  solved  in  0{t  logp)  time  on  the  p-processor 
pipelined  hypercube. 

The  broadcast  routing  can  be  redefined  as  follows.  Each  processor  P,,  0  < 
2  <  r  —  1,  has  to  broadcast  its  block  to  all  processors  Pj,  for  a,  <  j  <  b,,  where 
o-i  <  bt  <  ai+\  <  and  6r_i  <  p—  1.  Again  using  Lemma  1.4  and  Lemma  2.1. 
we  conclude  that  the  broadcast  routing  can  be  performed  in  time  0(t  -(-  logp) 
on  the  pipelined  hypercube. 

The  spread  routing  can  be  similarly  redefined  as  follows.  Each  processor  P;, 
0  <  2  <  r  —  1,  has  to  spread  its  block  among  processors  P^,  for  a,  <  j  <  bi,  with 
Pj  receiving  tj  data  items  after  spreading,  where  Yl\'=a,  =  t,  ai  ^  bi  <  a,+i  < 

and  br-i  <  p—  1.  This  is  similar  to  the  broadccist  routing  and  can  be  solved 
in  0(<  -f  logp)  time  on  the  pipelined  hypercube. 

The  collect  routing  is  the  inverse  of  the  spread  routing.  Each  processor  P,, 
0  <  2  <  r  —  1,  has  to  collect  all  the  data  items  from  the  processors  Pj,  for 
a.  <  j  <  bi,  with  Pj  having  tj  data  items  before  collecting,  where  =  t, 

^  bi  <  a,+i  <  bi+i  and  br-i  <  p  —  1.  This  can  be  solved  in  0{t  -I-  logp)  time 
on  the  pipelined  hypercube. 

Finally,  a  routing  problem  that  can  be  also  solved  optimally  on  the  pipelined 
hypercube  can  be  defined  by  the  set  <  s,,  >,  0  <  2  <  r  —  1,  where  the 

block  B,  in  P,,  has  to  be  moved  to  processor  P(,  and  where  {s,}  and  {<,}  are 
strictly  increasing  sequences.  Clearly,  this  routing  can  be  solved  by  a  combina¬ 
tion  of  concentrate  and  broadcast.  Notice  that  the  routing  problem  introduced 
in  Lemma  2.2  is  of  this  type. 

The  class  of  block  permutation  can  be  defined  as  follows.  Let  tt  be  a  per¬ 
mutation  of  {0,  1 ,  . . .,  p  —  1 },  The  goal  is  to  move  block  P,  of  processor  P,  to 
processor  P„(,),  0  <  2  <  p.  We  will  provide  a  solution  to  this  problem  based  on 
permutation  networks.  We  will  briefly  review  some  of  the  basic  facts  needed. 

A  Bones  permutation  network  of  Figure  2.1(a)  can  be  used  to  realize  any 
permutation  on  the  input.  For  any  given  permutation  tt,  the  switches  of  the 
Bones  permutation  network  realizing  tt  can  be  set  in  O(plogp)  sequential  time 
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[86],  or  in  O(log‘’p)  time  on  a  p-processor  hypercube,  shuffle-exchange,  cube- 
connected  cycles  or  the  butterfly  [50,57,71]. 

Any  Benes  permutation  network  can  be  emulated  on  the  weak  hypercube, 
the  shuffle-exchange  and  the  cube-connected  cycles.  However,  emulating  the 
permutation  network  on  the  pipelined  hypercube  is  not  appropriate,  since  it 
would  use  communication  links  in  different  dimensions  in  each  stage  and  this 
makes  pipelining  impossible.  So  we  need  another  permutation  network  which 
can  be  naturally  related  to  the  hypercube. 

A  butterfly  permutation  network,  defined  recursively  in  Figure  2.1(b),  can  be 
also  used  to  realize  any  permutation  on  the  input.  For  any  given  permutation 
TT,  the  switches  of  the  network  realizing  ir  can  be  similarly  set  in  O(plogp)  se¬ 
quential  time,  or  in  0(log‘'p)  time  on  a  p-processor  hypercube,  shuffle-exchange, 
cube-connected  cycles  or  butterfly.  There  is  an  obvious  connection  between  the 
butterfly  permutation  network  and  the  hypercube.  Moreover  pipelining  data  on 
the  butterfly  permutation  network  can  be  emulated  efficiently  on  the  pipelined 
hypercube  since  links  used  at  any  stage  correspond  to  communication  links  in 
only  one  dimension. 

Lemma  2.3  An  arbitrary  block  permutation  on  n  elements  can  be  performed  in 
time  0(^  -f  log‘‘p)  on  the  pipelined  hypercube. 

Proof:  We  can  find  the  paths  needed  for  the  given  permutation  n  in  time 
O(log‘‘p).  Then  an  MSB  routing  followed  by  an  LSB  routing  that  will  fully 
pipeline  the  elements  of  all  the  blocks  will  be  used.  Notice  that  no  conflicts 
will  arise  because  communication  links  in  different  dimensions  correspond  to 
different  stages  of  the  butterfly  algorithm.  □ 

Using  the  above  facts,  we  will  show  the  following  result  which  will  be  used 
heavily  in  obtaining  the  upper  bounds  on  the  pipelined  hypercube  model. 

Theorem  2.1  Given  a  p-processor  hypercube  such  that  each  processor  P,  holds 

a  block  of  data  Di  of  size  0  <  i  <  p  —  1.  Let  a  :  {0,  1, . . . .  p  —  1}  - y 

{0, 1, ...  ,p  —  1}  be  a  partial  function.  Suppose  it  is  desired  to  move  block  Z?a(i) 
to  processor  Pi,  whenever  a(j)  is  defined.  Then  this  can  be  done  in  0(# -f  log'*  p) 
time  on  the  pipelined  hypercube  model. 

Proof  Each  processor  P,  creates  a  record  <  >,  if  Q(f)  is  defined,  and 

a  record  <  f,  oo  >  otherwise.  All  the  p  records  are  sorted  by  their  second 
components.  Each  set  of  records  with  the  same  second  component  will  be  in 
consecutive  processors  after  sorting.  For  each  such  set,  we  mark  a  record  residing 
in  the  lowest  indexed  processor  as  the  representative  record  of  the  set.  Note 
that  the  representative  records  construct  a  one-to-one  partial  function,  .\ssume 
that  processors  Pj, ,  . . . ,  have  the  representative  records  <  ?i,n(?i)  > 
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Figuro  2.1:  A  Bencs  permutation  network  (a)  and  a  butterfly  permutation  net¬ 
work  (b) 
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,<  Z2-o('2)  ik.a{ik)  >  respectively.  Then  we  construct  a  permutation 

7r  such  that  K{a{ii))  =  ji-,  I  <  I  <  k,  and  send  Ba{i,)  to  by  the  block 
permutation  operation.  We  now  broadcast  Ba(i,)  to  the  consecutive  processors 
with  records  <  *,a{ii)  >.  Note  that  any  processor  with  the  record  <  i,a(i)  > 
has  the  block  and  that  sending  Ba{i)  to  Pi  can  be  done  similarly.  □ 

VV’e  now  consider  the  case  of  fixed  permutation,  for  example,  the  transpose 
of  a  matrix,  the  perfect  shuffle,  etc.  In  this  case,  the  routing  paths  can  be 
predetermined,  and  in  particular,  we  have  the  following  lemma. 

Lemma  2.4  Given  m  fixed  permutations  on  p  elements,  these  permutations  can 
be  realized  in  0[m  +  logp)  time  on  the  pipelined  hypercube.  □ 

Using  Lemma  2.1  and  Lemma  2.4,  many  of  the  routing  problems  considered 
in  [.31, .32, 37]  can  be  solved  optimally.  As  a  matter  of  fact,  we  have  developed 
much  simpler  algorithms  based  on  the  above  method  than  those  reported  in 
[31,32,37].  We  will  illustrate  this  with  an  example. 

The  all-to-all  personalized  communication  can  be  defined  as  follows  [32]. 
Processor  P^  has  p  blocks  B,  (),  B,  i, . . . ,  each  of  the  same  size  t,  and  B,,j 

is  supposed  to  be  moved  to  P^,  0  <  i,  j  <  p  —  1. 

Lemma  2.5  The  all-to-all  personalized  communication  problem  can  be  solved 
in  time  0{tp-\-  logp)  on  the  pipelined  hypercube. 

Proof:  The  procedure  can  be  divided  into  2p—  1  steps.  In  the  fth  step.  0  <  ?  < 
p  —  1,  Pj  sends  block  to  Pi+j,  0  <  j  <  p  —  i.  For  p  <i  <  2p  — 2,  processor 

Pj  sends  Pj,j_(2p_i_i)  to  Pj_(2p_i-t),  2p  —  1  —  f  j  <  p.  The  various  steps  can 
be  fully  pipelined,  and  the  proof  of  the  lemma  follows.  □ 

Finally,  we  introduce  the  following  permutation  problem  which  will  be  used 
in  the  next  section.  Consider  a  square  matrix  .4  of  size  n  x  n  stored  in  a 
hypercube  of  dimension  2d,  n  >  2*^.  Generalization  to  non-square  matrices  is 
straightforward.  There  are  several  ways  of  distributing  the  matrix  elements 
among  the  different  processors  of  the  hypercube.  We  mention  here  two  schemes 
of  interest.  In  consecutive  storage,  .4  is  decomposed  into  subarrays  of  equal 
sizes  and  each  subarray  is  stored  in  a  processor.  This  means  that  all  elements 
(Lj)  €  {0,l,...,n  —  1}  X  {0,l....,n  —  1}  of  the  n  xn  array  .4  that  satisfy  the 
relations  r  =  are  identified  with  element  (r.s)  E  {0.  1 . 2'^  — 

1 }  X  {0,  1 . 2'^  —  1 }  of  2“^  X  2"^  array  A'  which  can  embedded  in  a  2d-dimensional 

hypercube.  In  cyclic  storage,  all  elements  (i,  j)  of  A  that  satisfy  the  relations 
r  =  i  mod  ^  =  j  niod  2'^  are  identified  with  element  (r,  ,s)  of  the  array  ,4'.  The 
consecutive  and  cyclic  storage  schemes  are  illustrated  in  Figure  2.2  .  Using  the 
oljservations  above,  it  is  clear  that  the  conversion  between  consecutive  storage 
to  cyclic  storage  (and  vice  versa)  can  be  performed  in  time  0(^  +  h’g p)  on  the 
pipelined  hypercube. 
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2.3  Load  Balancing 

Balancing  load  among  processors  is  very  important  since  poor  balance  of  load 
generally  causes  poor  processor  utilization.  The  load  balancing  problem  is  a 
fundamental  problem  in  the  sense  that  the  fast  solutions  of  basic  problems  such 
as  sorting,  selection,  list  ranking,  graph  problems,  and  routing  require  fast  load 
balancing  [34,35,58,61,64].  In  this  section,  some  lower  bounds  and  tight  upper 
bounds  for  load  balancing  on  the  hypercube,  the  shuffle-exchange,  the  cube- 
connected  cycles  and  the  butterfly  are  shown. 

The  load  balancing  problem  is  defined  as  follows.  Let  n  items  be  distributed 
over  the  p  processors  of  a  network,  with  no  more  than  M  items  assigned  to  any 
single  processor,  [n/p]  <  M  <  n.  The  problem  is  to  redistribute  the  items  so 
that  the  number  of  items  in  any  two  processors  may  differ  by  at  most  one.  It 
is  irrelevant  where  a  data  item  is  routed  to.  This  problem  can  also  be  solved  in 
0{M  +  log p  •  min(log  log  logp))  time  on  a  bounded-degree  network  based  on 
expander  graphs  [58],  and  in  0{M \/\ogp  +  log^p)  time  on  the  weak  hypercube 
[61].  We  start  by  addressing  the  case  of  the  pipelined  hypercube. 

Assume  that  processors  Pq,  Pi,  ...,  Pp_i  of  the  pipelined  hypercube  have 
no,  ni,  . . .,  np_i  data  items  respectively,  and  that  n,  <  A/.  0  <  f  <  p  —  1. 

Without  loss  of  generality,  we  can  cissume  that  =  a  is  an  integer.  The 

basic  idea  of  our  algorithm  is  to  make  each  processor  P,  decide  where  to  move 
its  data  items  based  on  Hjlo  and  on  "'’th  the  goal  of  balancing  as 

many  processors  as  possible  starting  from  the  lowest  indexed  processor.  In  other 
words,  each  processor  P,  computes  /,  and  r,  such  that  /o  <  ^  <  •  ■  •  < 

<  f'p-i,  and  sends  its  data  elements  to  processors  P;, ,  P(,+i, . . . ,  Pr, .  More 
precisely,  define  /,  and  r,  to  be  integers  such  that  /,  ■  a  <  XI}=o +  l)a 
and  r,  •  a  <  <  (f”!  +  l)a,  respectively.  Notice  that  I,  and  r,  can  be  0  and 

that  /,  <  r,.  Then  P,  distributes  its  data  items  over  P/,  ,P/,+i, . . . ,  P^, ,  if  n,  >  0. 
If  li  <  r,,  then  p,  and  Pr,  will  receive  (/,  -f  l)a  —  and  Yl)=o^j  —  •  a 

data  items  from  P,,  respectively.  If  r,  >  /,  +  1,  P/.+i, . . . ,  Pr,-i  will  each  receive 
a  data  items.  P,  will  send  its  n,  >  0  data  items  to  P/,  in  the  case  when  /,  =  r,. 

procedure  BALANCE: 

[Bl]  For  each  i,  0  <  i  <  p  —  1,  compute  a,  r,  and  the  destination  address  of 
each  data  item. 

[B2]  Let  Pig,  P,, , . . . ,  P,^  be  all  the  processors  such  that  /,  <  r,  ,  0  <  j  <  k. 

Then  P,^  distributes  the  appropriate  data  items  over  Pi, . Pr,  -\.  This 

step  will  be  executed  in  two  substeps.  In  the  first  substep.  P,^  sends  the 
appropriate  data  items  to  Pj  by  using  the  concentrate  operation.  In  the 

second  substep,  P,  distributes  the  received  elements  to  P;  . Pr  _i  bv 

using  the  spread  operation. 
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[B3]  After  step  [B2],  each  processor  can  only  send  its  data  items  to  a  single 
processor.  But  each  processor  can  receive  data  items  from  more  than  one 
processor.  Assume  that  Pjj , . . . ,  P,^  are  all  the  processors  that  will  send 
their  data  items  to  P,-  and  that  P,  has  to  send  m;  data  items,  0  <  7  <  k. 
Clearly,  ^  ^  Also  assume  that  P;  is  the  x,-th  processor 

among  all  the  processors  that  will  receive  data  items.  Then  in  the  first 
substep,  P,o,  P,, , . . . ,  Pi^  send  their  appropriate  data  items  to  processor  P^,, 
one  at  a  time  by  using  the  collect  routing.  In  the  second  substep,  P^,  will 
send  the  collected  data  items  to  P,  by  using  the  broadcast  operation. 

The  step  by  step  implementation  of  BALANCE  is  illustrated  in  Figure  2.3. 
Figure  2.3(a)  shows  that  Iq  =  0,  Tq  =  2,  /j  =  ri  =  2,  I2  =  r2  =  2,  I3  —  —  2, 
I4  =  2,  r4  =  4,  I5  =  4,  Ts  =  6,  /e  =  rg  =  6,  /t  =  6  and  rr  =  7.  In  step 
[B2],  Po  distributes  its  appropriate  data  items  over  Pq  and  Pj,  P4  distributes 
its  appropriate  data  items  over  P2  and  P3,  P5  distributes  its  appropriate  data 
items  over  P4  and  P5,  and  Pj  sends  its  appropriate  data  items  to  P^.  This  step 
is  illustrated  in  Figure  2.3(b).  In  step  [B3],  P2  receives  the  appropriate  data 
items  from  Pq,  Pi,  Pj  and  P3,  P4  receives  data  items  from  P4,  Pq  receives  data 
items  from  P5  and  Pe,  and  P7  receives  data  items  from  P~.  Notice  that  xj  =  0. 
X4  =  1,  xe  =  2,  X7  =  3.  This  step  is  illustrated  in  Figure  2.3(c). 

Using  the  facts  shown  in  the  previous  section,  it  is  easy  to  show  the  following 
theorem. 

Theorem  2.2  If  each  processor  has  a  maximum  of  M  data  items,  then  load 
balancing  can  be  achieved  in  time  0{M  +  logp)  on  the  pipelinea  hypercube.  □ 

Notice  that  this  algorithm  is  optimal,  and  faster  than  the  considerably  much 
more  involved  algorithm  of  [58].  However  our  model  is  not  of  bounded  degree. 
VVe  now  consider  the  problem  on  the  weak  hypercube,  the  shuffle-exchange, 
cube-connected  cycles  and  the  butterfly.  The  following  two  lemmas  are  from 
[611. 

Lemma  2.6  The  load  balancing  problem  requires  ^l[k^^^M)  time  on  the  weak 
hypercube  when  M  =  0(^p*'^^)  and  M  >  □ 

Lemma  2.7  The  load  balancing  problem  can  be  solved  in  time  0{.\f  p  + 
\og^  p)  on  the  weak  hypercube.  □ 

Notice  that  the  lower  bound  of  Lemma  2.6  can  be  rewritten  in  the  form 
n(  If  =  O(^),  then  this  bound  reduces  to  Q(.M\/\og  p),  and  hence 

the  upper  bound  of  Lemma  2.7  is  tight.  VV'e  now  establish  similar  results  for  the 
shuffle-exchange  and  the  cube-connected  cycles. 


Po  Pi  Rj  P4  Ps  Ps 

(a) 


Po  Pi  P2  P3  P4  Ps  Ps  Pr 

(b) 


Po  Pi  P3  P4  Ps  Ps 

(  c  ) 


Figure  2.3:  Step  by  step  implementation  of  BALANCE 
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Theorem  2.3  The  load  balancing  problem  requires  n( "  -)-  M  ^ )  time 

on  a  cube-connected  cycles,  a  shuffle-exchange  or  a  butterfly  with  p  processors, 
when  M  >  — . 

—  p 

Proof:  We  establish  the  proof  for  the  shuffle-exchange  network.  The  proof  for 
the  cube-connected  cycles  and  the  butterfly  is  similar.  The  main  idea  of  the 
proof  is  to  pack  the  items  within  a  suitably  chosen  subset  of  the  processors  such 
that  the  number  of  links  connecting  this  subset  to  the  rest  of  the  processors  is 
small.  The  two  terms  in  the  lower  bound  correspond  to  two  different  choices  of 
the  subset.  Let  p  —  2*^. 

To  get  the  first  term,  let  M  <  2J2££.  Using  the  characterization  given  in  [.33] 
for  bisecting  the  shuffle-exchange  graph,  it  is  easy  to  see  that  we  can  partition 
the  verte.x  set  V  into  two  subsets  Vi  and  V2  such  that  |Ui|  =  and  only 

0{p/ log  p)  edges  (shuffle)  connect  Vi  and  V2.  Pack  all  the  n  elements  in  Vi.  At 
least  half  of  these  elements  have  to  cross  the  connecting  edges  to  V2.  Hence  the 
first  term  of  the  lower  bound  follows. 

To  obtain  the  second  term,  let  Vi  be  the  set  of  all  vertices  whose  binary 
representations  contain  at  most  r  ones,  r  a  positive  integei  <  j.  Clearly  |V'']|  = 
It  is  clear  that  the  number  of  edges  connecting  V)  to  the  rest  of  the 

nodes  is  bounded  by  Cf  Pack  all  the  elments  in  Thus  — )  steps 

are  required  to  send  at  least  half  of  the  elements  in  Vi  from  Vi  to  V-^.  Using  an 
approximation  shown  in  [61],  we  obtain  the  second  term  of  the  lower  bound.  □ 


Notice  that  the  first  term  in  the  above  lower  bound  is  dominant  whenever 
M  =  o( ^\/log p log  log p).  In  particular,  if  M  =  O(^),  we  obtain  that  load 
balancing  requires  n(A/logp)  time  on  the  shuffle-exchange  network,  the  cube- 
connected  cycles  or  the  butterfly.  In  view  of  Lemma  2.7,  we  conclude  that  the 
weak  hypercube  is  strictly  more  powerful  than  *he  bounded-degree  networks  in 
load  balancing.  Notice  that  these  bounded-degree  networks  can  simulate  normal 
hypercube  algorithms  with  only  constant  slowdown,  but  the  algorithm  of  [61]  is 
not  even  a  leveled  algorithm. 

The  following  theorem  can  be  obtained  by  implementing  algorithm  B.AL- 
ANCE  on  the  corresponding  networks. 


Theorem  2.4  The  load  balancing  problem  can  be  solved  on  a  p-processor  shuffle- 
exchange,  cube-connected  cycles  or  butterfly  in  time  0[M  logp-|-  log^p).  □ 


Notice  that  the  bound  of  the  above  theorem  is  tight  in  the  sense  that  it  is 
optimal  whenever  A/  =  0{^)  and  p  < 
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2.4  Sorting 

Sorting  is  a  fundamental  computational  problem  that  has  been  investigated 
for  several  decades.  Several  efficient  parallel  sorting  rdgerithms  on  the  PRAM 
model  [12,24]  and  on  the  network  model  [1,18,39,47,55,61,64,78,82]  have  been 
developed.  In  this  section,  we  introduce  two  sorting  algorithms,  mergesort  and 
columnsort ,  that  are  efficient  on  the  hypercube  under  some  conditions  described 
below. 

2.4.1  Mergesort 

Let  lF[0...n  —  1]  be  an  array  of  n  data  items  such  that  subarray 
IP,  =  lP[y . .  •  —  1]  is  stored  in  processor  P,,  0  <  f  <  p.  of  a  hypercube. 

Then  the  mergesort  algorithm  can  be  described  as  follows. 

procedure  MERGESORT: 

[51]  Each  processor  independently  sorts  the  subarray  stored  in  it. 

[52]  For  j  1  to  logp 

merge  in  each  subcube  of  size  2^^. 

VVe  now  describe  the  merge  operation  of  step  [S2]  that  utilizes  a  technique 
similar  to  that  of  the  BALANCE  algorithm.  This  algorithm  is  essentially  from 
[82].  Let  /l[0...y  —  1]  and  —  1]  be  two  sorted  arrays  of  elements 

to  be  merged.  The  following  simple  algorithm  merges  the  two  arrays  in  time 
(9(j  +  logp)  on  the  pipelined  hypercube.  We  assume  that  subarray  A,  = 

^(‘+1)  _  jj  jg  stored  in  processor  P,  and  subarray  Bj  =  B[^ . . .  —  l] 

is  stored  in  processor  Pl+j,  0  <  i,j  < 

procedure  MERGE: 

[Ml]  Let  a,  and  6j  be  the  minimum  elements  of  subarrays  A,  and  Bj  respectively, 
0  <  i,j  <  .Merge  the  a,'s  and  b/s  using  odd-even  or  bitonic  merge 
algorithm. 

[.M2]  If  a,  is  in  Pk  after  merge,  then  move  subarray  ,4,  from  P,  to  P^.  Similarly, 
if  bj  is  in  Pk  after  merge,  then  move  subarray  Bj  from  P^+j  to  P*... 

[.M3]  Find  every  i  such  that  P,  has  a  subarray  from  .1  and  P,+i  has  a  subarray 
from  B.  or  vice  versa. 

[.\I  l]  Let  t  be  such  that  P,  has  subarray  .4,.  and  P,+i . P,+/  have  subarrays 

Bi+i . Bt+i  respectively,  where  /  is  the  maximal  such  number.  .All  the 

other  cases  can  be  treated  similarly  and  simultancoiisly.  Then  broadcast 
A.,  from  P,  to  P,+i  through  P,+/. 


[M5]  In  P,+j,  remove  all  the  elements  of  A,  that  are  not  greater  than  bt+j  or 
greater  than  6(+j+i,  1  <  j  <  /.  In  P,+i,  remove  all  the  elements  of  A,  that 
are  not  greater  than  ht+i.  And  in  Pi,  remove  all  the  elements  of  A,  that 
are  greater  than  bt+i.  Let  n,j  be  the  number  of  elements  of  A,  remained 
in  P,+j,  U  <  ;  <  <’.  Clearly,  Ej=o”tj  =  j- 

[M6]  Merge  in  each  processor. 

[M7]  Processor  P,+j  sends  its  least  Tlk=j^>k  elements  to  P,+j-i,  1  <  i  <  /. 

Clearly,  step  [Ml]  can  be  performed  in  time  O(logp).  Step  [M3]  can  be  also 
performed  by  using  Concentrate  routing.  Steps  [M‘2]  and  [M4]  can  be  performed 
in  time  0{^  +  logp)  by  using  Spread  and  Broadcast  routing,  respectively.  Steps 
[M5]  and  [M6]  are  local  operations,  and  can  be  done  in  time  0{^).  Step  [M7] 
can  be  done  in  time  0(-  +  logp)  by  using  the  strategy  of  steps  [B‘2]  and  [B3]  of 
algorithm  BALANCE. 

Lemma  2.8  [82]  Given  two  sorted  arrays,  each  with  j  elements,  procedure 
MERGE  merges  the  two  arrays  in  time  0(--|-logp)  on  the  pipelined  hypercube. 
□ 

Corollary  2.1  Given  two  sorted  arrays,  each  with  ~  elements,  procedure 
MERGE  merges  the  two  arrays  in  time  on  the  weak  hypercube,  the 

shuffle-exchange  the  cube-connected  cycles  and  the  butterfly.  □ 

Step  [Si]  of  algorithm  MERGESORT  can  be  performed  in  time  D(Mog^) 
and  step  [S2]  can  be  performed  in  time  0{  +  log^p).  Thus,  the  total  time 

for  algorithm  MERGESORT  is  0(  -f-  log^p)  on  the  pipelined  hypercube. 

Theorem  2.5  [82/  If  n  elements  are  stored  in  p  processors  evenly,  the  elements 
can  be  sorted  in  time  ^  log^p)  on  the  pipelined  hypercube.  □ 

Corollary  2.2  If  n  elements  are  stored  in  p  processors  evenly,  the  elements  can 
be  sorted  in  time  //je  weak  hypercube,  the  shuffle-exchange,  the 

cube-connected  cycles  and  the  butterfly.  □ 

2.4.2  Columnsort 

In  many  applications,  we  need  an  algorithm  to  sort  small  numbers  faster.  In 
this  subsection,  we  will  show  how  to  sort  n  =  p'"*"'  numbers,  each  between  0 
and  in  time  0{j),  for  any  positive  constant  e  on  the  pipelined  hypercube 

by  using  the  columnsort  algorithm  of  Leighton  [17].  Clearly  the  algorithm  can 
be  implemented  in  time  O(Mogp)  on  the  weak  hypercuhe.  Han  used  the  same 
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Figure  2.-1:  A  6  x  3  matrix  before  and  after  sorting 
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Figure  2.o:  The  transpose  and  its  inverse 

technique  to  implement  the  columnsort  algorithm  on  the  complete  network  [27]. 
We  now  describe  the  columnsort  algorithm. 

The  columnsort  algorithm  is  a  generalization  of  odd-even  merge  sort  and 
will  be  described  as  a  series  of  elementary  matrix  operations.  Let  A  be  an 
r  X  s  matrix  of  numbers  where  rs  =  n,  s  is  divisible  by  r,  and  r  >  2(s  —  1)^. 
Initially,  each  entry  of  the  matrix  is  one  of  the  n  numbers  to  be  sorted.  .After  the 
completion  of  the  algorithm,  A  will  be  sorted  in  column  major  order  form.  For 
example  Figure  p  i  illustrates  a  typical  matrix  before  and  after  sorting.  Notice 
that  this  matrix  does  not  satisfy  r  >  2(s  —  1)^. 

The  columnsort  algorithm  has  eight  steps.  In  steps  1,  3.  5  and  7.  the  numbers 
within  each  column  are  sorted.  We  use  the  radix  sort  algorithm  here.  In  steps 
2,  4,  6  and  8.  the  entries  of  the  matrix  are  permuted.  The  permutation  in 


37 


/  \ 
Gi  a-  ai3 

step  6 

^  — OC  G4  Gio  Gi6  ^ 

Gt  Gs  Qh 

—  CC  G5  Gn  Gi7 

G3  Gg  Gis 

— OC  Gg  Gi2  Gis 

G.4  Qio  Gi6 

Gi  aj  Gi3  CO 

G5  Gii  Gir 

step  8 

^  Go  ai2  Gis  ^ 

y  03  Gg  Gis  OC  J 

Figure  2.6:  The  shift  and  its  inverse 

step  2  (shown  for  a  6  x  3  matrix  in  Figure  2.5)  corresponds  to  a  “transpose"’ 
of  the  matrix.  The  permutation  in  step  4  is  the  inverse  of  that  in  step  2.  The 
permutation  in  step  6  corresponds  to  an  [jJ  shift  of  the  entries,  and  is  shown 
for  a  6  X  3  matrix  in  Figure  2.6.  The  permutation  in  step  8  is  the  inverse  of 
that  in  step  6.  The  step  by  step  implementation  of  the  columnsort  algorithm  is 
shown  in  Figure  2.7. 

Before  describing  COLUMNSORT  algorithm,  which  is  the  hypercube  im¬ 
plementation  of  the  columnsort  algorithm,  we  show  how  to  efficiently  execute 
step  2  on  the  hypercube.  VVe  assume  that  each  processor  has  p’  numbers  be¬ 
tween  0  and  for  some  constant  e  >  0  and  that  =  2-'  for  some  integer  j. 
Let  a  be  such  that  p  =  2^'''*’,  0  <  6  <  2.  We  can  view  the  p  processors  as  2^ 
cubes  each  of  size  2“  x  2“  x  2".  Each  processor  in  cube  /,  (0  <  /  <  3),  can  be 
inde.xed  as  F£'(A:i , /, /:2»  ^’a),  where  0  <  ki,k2,k3  <  2“  —  1.  The  implementation 
of  step  2  can  be  done  a.s  follows  and  is  illustrated  in  Figure  2.8. 

[si]  For  each  ki,  I  and  ^21  let  PE(fci, /,*,*)  be  a  Arj-block  and 
P E{*,l,k2,*)  be  an  A.’2-block  of  2“  x  2“  processors.  VVe  can 
consider  each  block  as  a  matrix  of  size  2“  x  2“.  Transpose  each 
A;2-block  matrix. 

[s2]  Transpose  each  A,'i -block  matrix. 

[s3]  If  there  are  more  than  1  cube,  shuffie  the  A:2-blocks  of  cube  0 
with  those  of  cube  1.  If  there  are  more  than  2  cubes,  shuffie  the 
A'2-blocks  of  cube  2  with  those  of  cube  3. 

[s4]  If  there  are  more  than  2  cubes,  then  we  consider  2  consecutive 
A:2-blocks  as  an  A:2-block  and  shuffle  A:2-blocks  of  cube  0  and  1 
with  ^•2-blocks  of  cube  2  and  3. 
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Figure  2.7:  The  step  by  step  implementation  of  the  columnsort 


39 


/  So 

Sis 

Boo 

Bss  ^ 

/  Bo 

Si 

Bo 

Bo  \ 

/  So 

Si 

Bo 

Bo  \ 

Si 

Bit 

Boo 

S49 

Bis 

Bit 

Bis 

Bio 

Bs 

Bo 

Bs 

Bt 

So 

Sia 

Bos 

Boa 

Boo 

Boo 

Bos 

Boo 

Bs 

Bo 

Bio 

Bn 

So 

Bio 

Boo 

Boi 

Bss 

Bso 

Boo 

Boi 

Bio 

Bio 

Bis 

Si  5 

s* 

Boo 

Bos 

Boo 

Bs 

Bo 

Bs 

St 

Bis 

Bit 

Bis 

Bio 

Ss 

Soi 

Bot 

Boo 

Boo 

Boi 

Boo 

Boo 

Boo 

Boi 

Boo 

Boo 

s« 

Boo 

Bos 

Bos 

Bos 

Bot 

Bos 

S39 

Bos 

Boo 

Bos 

Bot 

Sr 

Boo 

Bos 

Boo 

(all 

Bss 

Bso 

Boa 

Ssi 

[»2] 

Bos 

Boo 

Boo 

Boi 

Bs 

Bo\ 

Bso 

Bos 

=> 

Bs 

Bo 

Bio 

Bn 

=> 

Boo 

Boo 

Bos 

Boo 

s. 

Boo 

Bsi 

Bot 

Bos 

Boo 

Bos 

Bot 

Bos 

Bot 

Bos 

Boo 

Bio 

Bos 

Bso 

Bos 

Bso 

Bsi 

Bso 

Bso 

Bso 

Bsi 

Bso 

Bso 

Bn 

Bot 

Bso 

Boo 

Bos 

Bot 

Bos 

Boo 

Bss 

Bso 

Bss 

Bst 

Sl2 

Bos 

Bss 

Bso 

Bio 

Bio 

Bis 

Bio 

Bss 

Bso 

S50 

Boi 

Bio 

Boo 

Bso 

Bsi 

Bos 

Boo 

Boo 

Boi 

Boo 

Boo 

Bos 

Boo 

Bi< 

Boo 

Bss 

Bso 

Bss 

Bso 

Bss 

Bst 

Bos 

Bot 

Bos 

Boo 

V  Stj 

Boi 

Bst 

Bso  / 

Bso 

Bsi 

Bso 

Bso  ) 

\  Bso 

Bsi 

Bso 

Bso  j 

Figure  2.8;  The  hypercube  implementation  of  step  2  of  the  columnsort.  Steps 
[s3]-[s5]  are  omitted 

[s5]  For  each  I,  and  ka,  perform  a  consecutive-to-cyclic  conversion 
of  elements  in  the  2“  processors  of  PE{*,  I,  k2,  k^). 

Using  the  facts  outlined  in  the  previous  section,  it  is  clear  that  the  time  com¬ 
plexity  of  the  above  algorithm  is  O(p')  on  the  pipelined  hypercube. 

If  e  >  2,  sorting  each  “column”  can  be  done  directly  in  each  processor. 
Otherwise,  sorting  in  step  1,  3,  5  and  7  can  be  done  by  a  recursive  application 
of  the  COLUMNSORT  algorithm  on  pa  processors.  The  recursive  application 
ends  when  the  number  of  remaining  processors  is  less  than  or  equal  to  pa  and 
the  recursion  depth  is  at  most  r*‘iogl‘1~  Step  4  is  the  inverse  of  step  2  and 
has  the  same  time  complexity.  The  permutations  in  step  6  and  8  can  easily  be 
implemented  and  their  time  complexities  are  O(p^).  Thus,  the  total  time  r(p,  u) 
for  COLUMNSORT  is  given  by 

T(p,p‘+‘)  =  4r(pT,pH')  +  (9(p'). 

Theorem  2.6  The  algorithm  columnsort  can  be  implemented  on  the  pipelined 
hypercube  to  sort  n  =  n{p’'*’')  numbers  between  0  and  p^^'*  in  time  O(^)  for 
any  positive  constant  e.  □ 

Corollary  2.3  The  algorithm  coiumnsort  can  be  implemented  on  the  weak  hy- 
percuhe,  the  shuffle-exchange ,  the  cube-connected  cycles  and  the  butterfly  to  sort 
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n  =  numbers  between  0  and  in  time  0{  — for  any  positive 

constant  e.  □ 

Notice  that  best  known  algorithm  of  integer  sorting  on  the  CRCW  PRAM 
runs  in  time  Q("— for  1  <  p  <  ” [24].  This  PRAM  algorithm  is  not 
efficient  even  on  CRCW  PRAM  while  our  algorithm  is  efficient  on  the  pipelined 
hypercube. 


2.5  Routing 

The  most  geneial  version  of  the  routing  problem  can  be  phrased  as  follows  [58]. 
The  (n,  ki,  ^2)  routing  problem  is  a  set  of  n  packets,  each  of  which  is  specified  by 
a  source  and  a  destination,  such  that  no  processor  appears  as  a  source  (respec¬ 
tively  destination)  in  more  than  ki  (respectively  k2)  packets.  We  are  assuming 
that  these  packets  reside  on  the  p  processors,  where  p  <  n.  The  problem  is  to 
route  these  requests  simultaneously.  The  best  known  solution  to  this  problem 
is  a  Q{ki  ^'2  +  ” * " )  algorithm  on  bounded-degree  fixed  connection  network. 
The  construction  of  such  networks  is  highly  nontrivial  and  depends  on  networks 
enabling  sorting  in  O(logn)  steps  and  on  expander  graphs.  In  this  section, 
we  will  present  a  solution  that  matches  this  bound  on  the  pipelined  hypercube 
for  all  p  <  n.  For  the  case  when  n  =  p^"*"',  for  any  positive  constant  e,  our 
algorithm  will  run  in  time  0{ki  -f  ^2  +  f)-  Optimal  results  on  the  weak  hy¬ 
percube,  the  shuffle-exchange,  and  the  cuoe-connected  cycles  are  also  presented 

ik,=0{^)  =  k2). 

The  algorithm  presented  in  [58]  depends  crucially  on  the  solutions  of  load 
balancing  and  sorting.  Our  algorithms  for  the  general  routing  problem  are 
similar  to  that  of  [58]  but  we  provide  new  solutions  to  the  above  two  subproblems 
on  our  networks.  The  load  balancing  and  the  sorting  problem  were  considered 
in  sections  2.3  and  2.4,  respectively. 

The  (n,  A:i,  ^’2)  routing  problem  can  be  handled  by  a  fully-normal  algorithm 
combined  with  load  balancing  to  obtain  the  following  lemma. 

Lemma  2.9  The  {n,ki,k2)  routing  problem  can  be  solved  in  time 
0{ki  +  k2  logp-f  log^  p)  or  the  pipelined  hypercube  by  using  a  fully-normal  algo¬ 
rithm  combined  with  the  load  balancing  algorithm. 

Proof:  The  algorithm  consists  of  logp  stages.  During  the  stage  i.  i  =  logp  — 
1 , . . . ,  0,  the  packets  in  Pj,  with  the  bit  i  of  their  destination  labels  different  from 
that  of  j,  are  moved  along  the  dimension  i.  We  then  apply  the  load  balancing 

algorithm  to  the  subcubes  determined  by  the  dimensions  0. 1 . i  —  \  .  Notice 

that  no  processor  will  ever  have  more  than  2A-2  packets.  □ 
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Corollary  2.4  The  {n,ki,h2)  routing  problem  can  be  solved  in  time 
0(ki  -f  k2  logp  +  log^  p)  on  the  pipelined  hypercube.  □ 

Actually  the  merge  sort  algorithm  of  the  previous  section  whose  time  com¬ 
plexity  is  0(  +  log^  p)  can  be  used  to  obtain  a  better  solution  w'henever  A'2 

is  large. 

Theorem  2.7  The  {n,ki,k2)  routing  problem  can  be  solved  on  the  pipelined 
hypeicube  in  time  0{k\  +  k2  +  sJspi  .p  log^p). 

Proof:  The  overall  strategy  is  simitar  to  that  of  [.58].  It  consists  of  (1)  load 
balancing.  (2)  sorting  by  destination  labels,  (3)  counting  the  number  of  packets 
that  have  to  be  sent  to  each  processor,  (4)  and  then  executing  steps  similar  to 
[B2]  and  [B3]  of  the  BALANCE  algorithm.  L-sing  the  time  bounds  of  the  load 
balancing  and  the  sorting  algorithms,  the  proof  of  the  theorem  follows.  □ 

We  can  do  even  better  in  the  case  when  n  is  much  larger  than  p  by  using 
algorithm  COLUMNSORT  of  the  previous  section  or  the  cubesort  algorithm 
of  Cypher  and  Sanz  [18]. 

The  cubesort  algorithm  sorts  n  =  p^'^T  elements  i.i  time  on  the 

shuffle-exchange  and  in  time  0{P  — on  the  weak  hypercube,  the  cube- 
connected  cycles  or  the  butterfly,  where  /  is  an  arbitrary  integer  greater  than  2. 
It  consists  of  0{l^)  stages,  and  each  stage  sorts  groups  containing  ^  elements 
and  performs  0{l)  shuffles  or  unshuffles.  This  algorithm  can  be  easily  modified 
to  sort  integers  between  0  and  tjjg  pipelined  hypercube.  Since  each 

group  is  contained  in  a  processor,  it  can  be  'orted  in  time  0{lj)  by  using  the 
radix  sort  algorithm.  And  by  Lemma  2.4,  the  shuffle  or  unshuffle  operation  can 
be  implemented  in  time  0(^-f  logp).  Thus  we  have  proved  the  following  lemma. 

Lemma  2.10  The  cubesort  algorithm  can  be  implemented  on  the  pipelined  hy¬ 
percube  to  sort  n  =  p^'^T  integers  between  0  and  p^B)  iji  time  0{P^),  where  I  is 
any  positive  integer  greater  than  2.  □ 

Notice  that  when  /  is  a  constant,  the  cubesort  algorithm  can  be  implemented 
in  time(9(^),  and  when  T  =  o(logn),  the  algorithm  is  faster  than  the  0{  2-!2£ii 

log^  p)  merge  sort  algorithm  of  the  previous  section. 

Corollary  2.5  The  {n,ki,k2)  routing  problem  can  b^  solved  on  the  pipelined 
hypercube  in  time  0(ki  -|-  k2  -f  when  n  =  p’'*'^  for  some  integer  /  >  2.  □ 

We  now  turn  our  attention  to  the  case  of  the  weak  hypercube.  When  n  — 
p'"'’’',  for  a  fixed  positive  constant  /,  the  integer  sorting  and  the  (n.  A-j.  A'2)  routing 
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problem  can  solved  in  time  O(^logp)  and  0[{ki  +  ^'2)  logp)  respectively  on  the 
weak  hypercube,  since  each  step  on  the  pipelined  hypercube  can  be  simulated 
with  logp  steps  on  the  weak  hypercube.  When  /  is  not  fixed,  the  routing  problem 
can  be  soK'ed  in  time  0((^'i  +  k2)  logp+Z^"*”^”)  by  using  the  cubesort  algorithm 
and  a  straightforward  load  balancing  algorithm.  However,  the  following  lemma 
established  by  Plaxton  [61]  for  sorting  on  the  weak  hypercube  and  the  load 
balancing  algorithm  of  Lemma  2.7  can  be  used  to  obtain  a  faster  algorithm. 

Lemma  2.11  Let  q  =  log^^^  p  log  log  p.  Then,  if  n  <  pq.  the  n  elements  can  be 
sorted  in  time 

T{n.p)  =  0(-log(7r/p)  +  -log‘'^p  +  log^plog(r7/p)), 

P  P 

and  if  n  >  pq,  they  can  be  sorted  in  time 

T{n,p)  =  0(-logp(- — +  log^plog(n/p)), 
p  log(n/(p9)) 

on  the  weak  hypercube.  □ 

Corollary  2.6  The  {n.ki,k2)  routing  problem  can  be  solved  on  the  weak  hyper¬ 
cube  in  time  0{{ki  +^’2)  log''^^p-hr(n,p)),  where  T[n,p)  is  as  defined  in  Lemma 
2.11.  □ 

The  facts  that  sorting  can  be  done  in  timeO(/”'"^-)  (respectively 
on  the  shuffle-exchange  (the  cube-connected  cycles  or  the  butterfly)  when  n  = 
for  some  I  >  2,  can  be  used  to  show  the  following. 

Corollary  2.7  The  {n,ki,k2)  routing  problem  can  be  solved  on  the  shuffle- 
exchange  or  the  cube-connected  cycles  in  time  0{{ki  -f  A’2)logp-(-  I  — ” )  or 

0{{ki  +  ki)  logp  -f-  /22J2&IL)  u'hen  n  =  for  some  /  >  2.  □ 

Notice  that  the  above  upper  bounds  for  the  shuffle-e.xchange,  the  cube- 
connected  cycles  and  the  butterfly  are  optimal  whenever  ki  =  0{j).  ^'2  =  0{j) 
and  /  is  a  fixed  constant. 


2.6  Relationship  With  The  CRCW  Model 

■Several  powerful  techniques  have  been  developed  for  designing  efficient  parallel 
algorithms  for  the  PRAM  model,  and  it  seems  that  this  model  is  ideal  for  dis¬ 
covering  inherent  parallelism  and  for  writing  parallel  algorithms.  Therefore  it 
i.'i  important  to  deveflop  an  efficient  step  by  step  simulation  of  a  PR.\M  algo¬ 
rithm  on  our  networks.  Several  simulations  from  the  PR.AM  model  onto  these 
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networks  are  known.  However  there  is  always  some  loss  of  efficiency  incurred  by 
these  simulations.  The  best  known  result  emulates  each  step  of  a  PRAM  algo¬ 
rithm  using  p  processors  in  time  0( (log  p  log  A/)/ log  log  p),  where  M  is  the  size 
of  the  memory  used  by  the  PRAM  algorithm  [29].  This  bound  can  be  reduced 
to  (9(logplog  logp)  if  M  =  p(logp)‘^^')  [28].  The  bounded-degree  networks  used 
are  based  on  expander  graphs  that  can  sort  optimally.  Below  we  derive  a  simple 
simulation  result  for  the  pipelined  hypercube  model. 

Lemma  2.12  Let  A  be  a  PRAM  algorithm  whose  running  time  is  T{n)  using 
a  total  of\V{n)  operations  and  a  memory  of  size  M{n)  =  0{n),  where  n  is  the 
input  size.  Then  A  can  be  simulated  on  a  p-processor  pipelined  hypercube  in 
OC-^^^^+^T{n)  +  T{n)log'^p)  tune. 

Proof:  Decompose  the  memory  into  p  equal  size  blocks  R,,  0  <  z  <  p  —  1. 
The  memory  M,  of  processor  P,  of  the  hypercube  will  hold  the  data  in  Bi.  Let 
ll  j(n)  be  the  number  of  operations  used  in  step  j,  1  <  j  <  T{n),  of  the  PR.\M 
algorithm.  Without  loss  of  generality,  assume  that  this  step  involves  a  read 
operation.  The  other  cases  are  similar.  Each  processor  P,  of  the  hypercube  will 
handle  of  these  operations.  This  can  be  handled  by  a  routing  algorithm 

where  each  processor  sets  up  packets  with  the  destinations  assigned  as 

determined  by  the  initial  memory  map.  Hence  each  processor  is  the  source  of 
packets  but  could  be  the  destination  of  =  O(^)  packets.  This 

can  be  done  in  q ( _|.  a q. log^  p)  time.  Therefore  the  overall  simulation 

requires  J  +  log*?))  =  +  ar(n)  +  T(n}  log^p). 

□ 

For  example,  for  algorithms  with  IFln)  =  A/(n)  =  0(n)  and  T(n)  = 
O(logn),  the  simulation  bound  reduces  to  0( -f-  lognlog^p).  Note  that  if 

A/(ri)  ^  0(n),  A  can  be  simulated  in  time  0(  -f-  -h  T{ji)  log*  p) 

on  the  pipelined  hypercube.  Note  also  that  the  above  simulation  result  can  be 
slightly  improved  if  n  =  n(p''*''),  for  some  fixed  £  >  0.  Using  a  routing  algorithm 
from  the  previous  section,  the  bound  of  the  above  lemma  can  be  improved  to 
+  ^T(n)).  By  the  same  reasoning,  the  PRAM  model  can  be  simulated 
on  the  weak  hypercube,  the  shuffle-exchange,  the  cube-connected  cycles  and  the 
butterfly  as  follows. 

Corollary  2.8  Let  A  be  a  PR.A.M  algorithm  whose  running  time  is  T(n)  using 
a  total  of\V{n)  operations  and  a  memory  of  size  M(n)  =  0(n),  where  n  is  the 
input  size.  Then  A  can  be  simulated  in  time  0{  "gp  4.  on  a 

p-processor  weak  hypercube,  the  shuffle-exchange,  the  cube-connected  cycles  and 
the  butterfly.  □ 


Chapter  3 


Almost  Uniformly  Optimal 
Algorithms 

3.1  Introduction 

Suppose  we  are  given  a  network  .\f  with  p  processors.  An  algorithm  to  solve  a 
given  problem  of  size  n  >  p  will  be  called  almost  uniformly  optimal  if  the  running 
time  of  the  algorithm  is  provably  the  best  possible  for  all  p  <  n/  log^  n,  for  some 
fi.xcd  constant  k.  Such  algorithms  have  been  developed  for  many  problems  on 
the  PRAM  model.  However  except  for  very  few  cases,  no  such  algorithms  are 
known  on  the  network  model. 

VVe  address  in  this  chapter  several  problems  that  can  be  solved  by  almost 
uniformly  optimal  algorithms  on  our  netw'orks.  These  problems  are  the  all  near¬ 
est  smaller  values  (ANSV)  problem,  and  some  problems  in  computational  ge¬ 
ometry  and  in  VLSI  routing.  All  these  problems  can  be  solved  efficiently  in 
O(logn)  time  on  the  CREW  PRAM  (and  even  faster  on  the  CRCW  PRAM). 
The  PRAM  algorithms  can  be  directly  simulated  in  time  0{  -1°^^  -|-  logn  log^  p) 
on  a  p-processor  pipelined  hypercube.  We  provide  faster  algorithms  to  handle 
these  problems.  We  also  provide  lower  bound  proofs  of  the  problems  on  the  weak 
hypercube,  the  shuffle-exchange,  the  cube-connected  cycles  and  the  butterfly. 

The  rest  of  the  chapter  is  organized  as  follows.  The  AXSV  and  related 
problems  are  considered  in  section  3.2.  The  algorithms  for  a  couple  of  basic 
problems  in  VLSI  routing  are  presented  in  section  3.3.  The  last  section  is  devoted 
to  the  lower  bound  proofs. 

3.2  ANSV  and  Related  Problems 

The  all  nearest  smaller  values  (AXSV)  problem  can  be  defined  as  follows.  The 
input  consists  of  an  array  A  —  (oq,  Ui, . . . ,  a„_i ),  where  the  a,'s  come  from  a 


15 


totally  ordered  set.  The  output  is  an  array  B  such  that  B{i)  =  ar(,)), 

where  a((,)  and  a^^i)  are  respectively  the  nearest  elements  to  the  left  and  to  the 
right  of  a,  that  are  less  than  a,,  if  they  exist.  If  one  or  both  of  them  do  not  exist, 
this  can  be  indicated  with  a  special  symbol.  We  call  whenever  it  exists, 
the  left  match  of  a,.  We  similarly  call  ar(07  whenever  it  exists,  the  right  match 
of  a;. 

The  ANSV  problem  was  introduced  in  [7].  It  turns  out  that  merging  is  a  spe¬ 
cial  case  of  ANSV.  This  problem  can  be  solved  sequentially  in  linear  time  by  us¬ 
ing  a  stack.  An  optimal  CRCW  PRAM  algorithm  with  running  time  0(log  logn) 
was  shown  in  [7].  This  algorithm  was  then  used  to  solve  the  monotone  polygon 
triangulation,  the  binary  tree  reconstruction,  and  parenthesis  matching  within 
the  same  bounds. 

VV^e  will  present  an  almost  uniformly  optimal  algorithm  to  handle  the  .A.NSV 
problem  on  the  pipelined  hypercube  model.  We  will  later  see  that  its  implemen¬ 
tation  on  the  shuffle-exchange,  the  cube-connected  cycles  or  the  butterfly  is  also 
almost  uniformly  optimal. 

.Assume  from  now  on  that  all  elements  in  ,4  are  distinct  and  that  n  and  p 
are  both  powers  of  2.  The  other  cases  are  treated  in  a  straightforward  way.  We 
start  with  a  simple  divide-and-conquer  algorithm  which  is  fast  in  the  case  when 

iTihr  <  p  ^ 

procedure  Simple  ANSV 

Input:  An  array  A  —  (aoi  •  •  • » an-i)  stored  consecutively  on  a  p-processor  hy¬ 
percube. 

Output:  .An  array  B  =  (6o, . . . ,  fen-i)  such  that  6,  =  (n/(,),  ar(,)),  where  a/(,)  and 
ar{i)  are  respectively  the  left  and  the  right  matches  of  a,. 

1.  If  p  =  1,  then  use  an  optimal  serial  algorithm  to  solve  the  ANSV'  problem. 

2.  Recursively,  solve  separately  each  of  the  the  ANSV  problems  corresponding 
to  the  two  subarrays  Aq  =  (ao,  oi,  071/2-1 )  and 
Ai  =  (a„/2,  a„/2+i, . . . , stored  in  the  low  and  high  subcubes. 

.3.  Let  Aq  =  (a,j,...,a,J  be  all  the  elements  of  Aq  that  do  not  have  their 
right  matches  in  Aq,  and  let  A[  =  (oj, , . . . ,  a^, )  be  all  the  elements  of  .4i 
that  do  not  have  their  left  matches  in  Ai.  Then  a,j  <  a,j  <  . . .  <  a,^  and 
^ji  ^  >  . . .  >  Uj, .  Find  the  right  matches  of  the  elements  of  .4()  and 

the  left  matches  of  the  elements  of  A',  by  merging  the  corresponding  two 
sequences. 

Lemma  3.1  The  right  match  of  each  element  in  Ag  is  in  A,,  if  it  existe<.  Sim¬ 
ilarly,  the  left  match  of  each  element  of  A',  is  in  .4q,  if  it  exists. 
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Proof:  Let  a  be  an  element  of  whose  right  match  b  is  in  Ai-  Then  clearly  a 
is  in  Aq.  Suppose  that  b  is  not  in  A\.  This  implies  that  b  has  a  left  match  c  in 
Ai.  Hence  c  <  b  and  c  appears  before  b  in  the  array  A.  But  this  means  that  the 
right  match  of  a  cannot  be  b.  Therefore  b  does  not  have  a  left  match  in  Ai.  □ . 

Theorem  3.1  Algorithm  Simple  ANSV  correctly  solves  the  ANSV  problem  in 
(9(  +  log^  p)  on  a  p-processor  pipelined  hypercube. 

Proof:  The  correctness  proof  is  simple  and  can  be  established  by  induction  on 
p.  As  for  the  running  time,  we  note  that  the  merging  required  at  step  3  can  be 
done  in  logp)  on  the  pipelined  hypercube  model  (subsection  2.4A).  Thus 

the  time  bound  follows.  □ 

We  will  develop  a  faster  algorithm  in  the  rest  of  this  section.  We  start 
by  stating  a  simple  fact  (also  stated  in  [7])  that  will  be  needed  to  justify  the 
algorithm. 

Lemma  3.2  [7]  Let  \  <  j  <  n  be  an  arbitrary  index  and  let  /[ji]  =  {_/  + 
1,. . .  ,r{j)  —  1}  whenever  r(j)  exists.  Then  the  following  statements  hold. 

(1)  Ifke  I[j],  then  r{k),l{k)  E  /[j]  U  {j,r(j)}. 

(2) IfkiI[j],thenr{k),l{k)^I[j]. 

Similar  statements  hold  for  I'[j]  =  {l{j)  +  —  1).  □ 

Partition  A  consecutively  into  p  blocks,  say  Aq,  Ai, . . . ,  .4p_i,  i.e..  A,  = 
iain/p,atn/p+i,  ■  •  •  1  a(i4-i)n/p-i)»  0  <  f  <  P—  T  Let  m{i)  be  the  index  of  the  mini¬ 
mum  element  in  Aj.  Define  the  reduced  array  A'  to  be 

A'  —  (am(o)?  am(i)i  •  •  •  1  am(p-i))-  For  each  block  A,,  all  the  elements  in  A,  ap¬ 
pearing  before  am(i)  have  their  right  matches  in  A„  and  all  elements  appearing 
after  aT7i(i)  have  their  left  matches  in  Ai.  We  now  state  the  following  fact  from 

Lemma  3.3  Let  Ai  be  an  arbitrary  block  such  that  the  right  match  of  0^(1) 
belongs  to  block  Ai<,  i'  ^  i  Then  there  exists  a  unique  k,  i  <  k  <  i' ,  such 

that  the  left  match  o/Umf/.-)  Is  in  .-L  and  the  right  match  of  am(k)  Is  in  A,>.  □ 

Let  .-I,  be  an  arbitrary  block  such  that  the  right  match  of  a,n(,)  is  in  block  .4,<. 
With  each  A,  we  associate  two  subsequences  (5i,o,  S'i.i),  where  Si.o  is  a  segment 
of  Ai  and  5,,i  is  a  segment  of  A,/  defined  as  follows: 

Ijlfz  t-j-li  then  ^i.O  —  (^m(i)  1  •  •  •  )  ^(i+ 1  )p)  3.nd  Sii  —  ((I(  i.(.]  1 ,  .  .  .  .  n  r(  m(  1) )  )  • 

See  Figure  3.1(a). 

2)  If  i'  >  i  -f  1,  then  let  k  be  the  index  whose  existence  is  mentioned  in  the 

above  lemma.  Then  5,,o  =  («m(.),  •  •  ■  ,«/(m(fc)))  S’i.i  =  («r(m(i)) . «r(m(.)))- 

See  h'igure  3.2(b). 


(b) 


Figure  3.1:  Subsequences  corresponding  to  (a)  F  =  i  +  1  and  (b)  F  >  i  +  1 
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Let  S'q  (S' i)  be  the  set  of  all  the  elements  in  5,.o  that  do  not  have 

their  right  (left)  matches  in  /I,  (.4').  By  Lemma  3,  it  is  clear  that  the  right 
matches  of  S'q  are  in  5'j  and  the  left  matches  of  5'  j  are  in  S'q,  if  they  exist. 
Moreover  the  elements  of  S'q  are  in  increasing  order  while  the  elements  of  5'  j 
are  in  decreasing  order.  Hence  we  can  find  the  right  matches  of  S'q  and  the  left 
matches  of  5'  i  by  merging  them. 

We  can  similarly  consider  each  block  A,  such  that  the  left  match  of  0^(1) 
exists.  We  can  then  introduce  an  index  similar  to  k  and  the  corresponding  sub¬ 
sequences.  .Again  the  problem  comes  down  to  a  set  of  disjoint  merging  problems. 

It  turns  out  that  the  left  and  the  right  matches  of  all  the  elements  can  be 
obtained  by  determining  the  left  and  the  right  matches  within  each  block,  and 
by  merging  the  pairs  of  subsequences  arising  by  considering  the  right  matches 
of  am(0’s  ^nd  then  merging  the  subsequences  arising  from  the  left  matches  of 
the  s. 

We  are  ready  to  describe  our  algorithm, 
procedure  ANSV 

Input:  An  array  .4  =  (no-^i _ of  n  distinct  elements. 

Output;  The  array  D  =  (60,^1 _ such  that  b,  =  0^(1)).  where  o;;,) 

and  ar(i)  are  respectively  the  left  and  the  right  matches  of  a^. 

1.  Let  .4,  =  (a^./p _ _ nn(i+i)/p-i)  the  array  stored  in  P,,  0  <  i  <  p  —  1. 

Solve  the  ANSV'^  problem  corresponding  to  each  subarray. 

2.  Find  the  minimum  element  am(i)  *11  each  subarray  A,. 

3.  Solve  the  ANSV'  problem  corresponding  to  the  reduced  array  A'  on  p 
processors. 

4.  For  each  subarray  A,,  if  the  right  match  of  am{,)  is  in  block  .4,+i.  then 
move  subarray  A, 4.1  to  P,.  Merge  the  corresponding  subsequences  within 
each  processor.  .Move  the  left  matches  found  for  the  subsequence  in  .4,+i 
to  P.+i. 

5.  For  each  subarray  A,  such  that  the  right  match  of  0^(0  's  in  block  .4,' 
with  F  >  z  +  1.  determine  the  index  k  described  in  Lemma  4.  .Move  a,n(k) 
and  A,<  to  processor  P,.  Merge  the  corresponding  subsequences  within 
each  processor.  Move  the  left  matches  found  for  the  subsequence  in  .I,-  to 

P.'. 

6.  Repeat  steps  4  and  5  for  the  left  pairs  of  subsequences. 

Theorem  3.2  Algorithm  A  \S\'  correctly  finds  all  the  right  and  the  left  matches 
of  the  array  .4  of  n  elements.  It  can  be  implemented  on  a  p-proce.^sor  pipelined 
hypercube  to  run  in  time  0(^  +  log'^p). 
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Proof:  The  correctness  proof  follows  essentially  from  [7].  We  will  now  establish 
the  stated  time  bound. 

Steps  1  and  2  can  be  obviously  done  in  0{^).  Step  3  can  be  done  in  0(log^  p) 
time  by  using  Simple  .A.NSV.  Step  4  can  be  implemented  as  follows.  Each  pro¬ 
cessor  P,  determines  whether  or  not  the  right  match  of  am(«)  found  in  step  3  is 
the  element  amCi+i)-  this  is  the  case,  it  sets  up  a  request  for  the  subarray  /l,+i. 
Since  each  .4,  is  of  size  the  corresponding  data  movement  can  be  accomplished 
in  (9(^  -b  logp)  time. 

As  for  step  5,  each  processor  say  Pk  sets  up  a  request  for  the  right  match 
of  where  i  is  the  inde.x  of  the  subarray  containing  the  left  match  of  a^^k)- 

These  requests  can  be  satisfied  in  O(log^p)  time.  Then  Pk  checks  to  see  if  the 
right  match  of  «„(,)  is  in  the  same  subarray  a.s  that  of  am(it)-  If  this  is  the  case. 
o-m(k)  is  sent  to  P,  in  the  next  step.  Let  a  :  {0, 1,. . .  ,p  —  1}  — >  {0, 1, ...  .p  —  1} 
be  the  partial  function  Q(i)  =  where  i'  is  the  index  of  the  subarray  containing 
the  right  match  of  <2^(1)-  ^Ve  can  now  apply  Theorem  2.1  to  perform  the  data 
movement  required  for  step  5.  Merging  within  each  processor  can  be  easily  done 
in  0{^)  time.  The  time  required  for  the  data  movement  is  +  log'^p).  The 
only  thing  left  is  to  send  back  the  left  matches  of  corresponding  elements  in  A,> 
from  P,  to  P,'.  This  can  also  be  done  by  similar  steps  as  in  Theorem  1  and  the 
collect  operation. 

The  analysis  of  the  time  required  to  execute  step  6  is  similar  and  hence  the 
theorem  follows.  □ 

.Note  that  the  above  performance  cannot  be  improved  for  p  <  even  on 
the  CRCW  PRAM  model.  " 

Corollary  3.1  The  .4A"5E  problem  can  be  solved  in  0{  -p  log'^p)  ti  me  on 
a  p-processor  weak  hypercube  with  a  normal  algorithm,  ifence  it  can  be  solved 
within  the  same  time  bound  on  a  p-processor  butterfly,  shuffle-exchange,  and 
cube-connected-cycles.  □ 

3.2.1  The  Parentheses  Matching  Problem 

Let  .4  =  (uo,  O] . . . . ,  an_i )  be  a  legal  sequence  of  parentheses,  where  each  a,  ='(’ 
or  ‘)’.  The  parentheses  matching  problem  is  to  determine  for  each  a,  the  index  j 
such  that  Oj  is  the  match  of  a,.  It  is  well-known  that  this  problem  can  be  solved 
as  follows.  Compute  the  nesting  levels  of  the  parentheses  using  prefix  sums,  and 
then  apply  .ANSV  to  find  matching  parentheses  which  will  have  the  same  level 
of  nesting  but  the  level  of  nesting  of  any  parenthesis  between  them  is  higher. 

(  sing  our  ANSV  algorithm,  we  obtain  the  following  corollary. 

Corollary  3.2  Given  a  legal  sequence  of  left  and  right  parentheses,  the  paren¬ 
theses  matching  problem  can  be  solved  within  the  same  time  bound  as  that  of 
.4.V5V'  on  any  of  the  parallel  models  considered  in  this  thesis.  □ 


3.2.2  Triangulating  a  Monotone  Polygon 

A  polygonal  chain  Q  =  (go,  •  •  •  •.  9n-i)  is  said  to  be  monotone  if  the  vertices 
qo, . . .  ,qn-i  are  in  increcising  (or  decreasing)  order  by  their  x-coordinates.  A 
monotone  polygon  consists  of  an  upper  monotone  polygonal  chain  and  a  lower 
monotone  polygonal  chain.  The  goal  is  to  determine  a  set  of  edges,  each  edge 
connecting  a  pair  of  vertices,  which  will  triangulate  the  input  polygon. 

A  known  strategy  to  solve  the  problem  of  triangulating  a  monotone  polygon 
is  the  following.  Let  a  one-sided  monotone  polygon  be  a  monotone  polygon 
whose  upper  or  lower  chain  is  a  straight  line.  .A  solution  strategy  consists  of  (i) 
decomposing  the  input  polygon  into  one-sided  monotone  polygons  by  merging 
the  vertices  of  the  lower  and  the  upper  chains  and  then  (ii)  use  ANSV  to  deter¬ 
mine  additional  edges  needed  to  triangulate  the  input  polygon.  We  thus  have 
the  following  corollary. 

Corollary  3.3  A  monotone  polygon  can  be  triangulated  within  the  same  time 
bound  as  that  of  solving  the  ANSV  problem  on  any  of  the  parallel  models  con¬ 
sidered  in  this  thesis.  □ 

3.2.3  The  All  Nearest  Neighbor  Problem 

Let  Q  =  (go,  •  •  •  ,gn-i)  be  a  convex  polygon,  where  (g,,  g,+i)  is  an  edge  of  Q. 
0  <  i  <  n  —  2.  The  all  nearest  neighbor  problem  for  Q  is  to  determine,  for  each 
vertex  g,,  a  vertex  gj,  i  ^  J,  such  that  the  Euclidean  distance  between  g,  and  g^  is 
minimal.  This  problem  can  be  solved  optimally  by  using  essentially  merging  [69]. 
Hence  the  corresponding  algorithm  runs  in  time  0(^-1-  log  p)  on  the  p-processor 

pipelined  hypercube  and  in  time  0( "  -t-  log^  p)  time  on  the  p-processor  weak 

hypercube,  butterfly,  shuffle-exchange,  or  cube-connected-cycles. 


3.3  VLSI  Routing 

In  this  section  we  introduce  two  basic  problems  whose  solutions  can  be  used  to 
solve  several  VLSI  routing  problems.  We  start  with  the  first  problem  which  is 
useful  for  handling  river  (one-layer)  routing  problems  [10].  The  input  consists  of 
two  arrays  B  =  (ho,  hi, ,  6n-i)  and  T  =  {to,ti,...,  <„_i ),  where  bo  <  bi  <  . . .  < 
6„_i  and  to  <  ti  <  . . .  <  <„_i  such  that  bj  <  bj+i  <  tj,  for  all  0  <  j  <  n  —  2. 
The  output  is  the  array  5  =  (^j(o),  ,  fj(n-i)),  where  j{i)  =  minj{j  < 

i\tj  -\-i  —  j  —I  >  b,}.  If  we  view  B  and  T  as  representing  the  bottom  and  the  top 
terminals  of  an  arbitrary  instance  of  river  routing,  then  all  the  bendpoints  of  the 
detailed  routing  can  be  deduced  from  S.  Moreover,  the  minimum  separation  is 
given  by  max,{i  —  j(i)  -f  1}  -I-  1. 

A  simple  algorithm  to  handle  this  problem  can  be  obtained  as  follows.  Let 
=  tj  —  i  —  1  and  b[  =  bt  —  i.  Then  j{i)  is  given  by  j{i)  =  minj{j  <  i\t'j  >  6'}. 


This  can  be  done  by  merging  the  (ip,  b[,. . 6^_i)  and  . . . ,  and  then 

determining  for  each  6'  the  nearest  to  the  right.  Hence  this  problem  can  be 
solved  in  +  logp)  time  on  a  p-processor  pipelined  hypercube,  and  in  time 

0{  +  log^  p)  on  a  p-processor  butterfly,  shuffle-exchange,  cube-connected- 

cycles,  or  weak  hypercube.  See  section  5.2  for  more  details.  We  will  later  derive 
the  corresponding  lower  bound. 

The  second  basic  problem,  called  line  packing,  consists  of  packing  a  set  of 
n  intervals  /q, /i , . . . , /„_i  using  the  minimum  possible  number  of  tracks  [19]. 
More  precisely,  our  input  is  given  ^is  an  array  A  =  (oq,  Oi, . . . ,  a2n-i ),  where 
Qj  =  {xj,id{j),mark{j))  such  that  Xj  is  the  i-coordinate  of  a  terminal  (end¬ 
point  of  an  interval),  id{j)  is  the  serial  number  of  the  corresponding  interval, 
and  mark{j)  indicates  whether  aj  corresponds  to  the  left  endpoint  or  the  right 
endpoint  of  Moreov^er  we  are  assuming  that  the  a,’s  are  sorted  by  their 

first  components.  The  desired  output  is  an  array  B  =  (6o,  6i, . . . ,  62„_i)  such 
that  bj  =  Us(j),  where  corresponds  to  a  left  terminal  that  follows  in  the  same 
track  the  right  terminal  of  aj.  If  aj  corresponds  to  a  left  terminal  or  no  interval 
comes  after  lid(j)  in  the  same  track,  then  bj  is  not  defined.  The  number  of  tracks 
should  be  minimized.  It  turns  out  that  the  minimum  number  of  tracks  is  equal 
to  the  density  d  =  maXx{di},  where  dx  is  the  number  of  intervals  containing  x. 

The  solution  of  the  line  packing  problem  can  be  used  to  solve  the  channel 
routing  problem  in  the  two-layer  model,  where  each  column  contains  at  most 
one  terminal.  It  can  also  be  used  to  perform  optimal  routing  in  the  knock-knee 
model  [9].  The  algorithm  is  given  below. 

procedure  Line  Packing 

Input:  A  sorted  array  A  =  (oq,  Oi, . . . ,  a2n-i)  representing  the  endpoints  of  n 
intervals.  When  two  endpoints  have  an  equal  x-coordinate,  the  right  endpoint 
precedes  the  left  endpoint  in  A. 

Output:  The  array  B  =  (6o,  6i, . . . ,  62n-i )  as  defined  above. 

1.  Assign  +1  to  each  left  terminal  and  —1  to  each  right  terminal,  and  compute 
the  prefix  sums  of  all  the  terminals. 

2.  For  each  right  terminal  Gj  whose  prefix  sum  value  is  u,  find  the  nearest 
left  terminal  a,^j)  to  the  right  of  Gj  whose  prefix  sum  value  is  greater  than 
u.  Set  bj  =  a^(j)  if  such  s{j)  exist,  and  nil  otherwise. 


As  an  example,  a  channel  routing  instance,  the  corresponding  sorted  list  and 
prefix  sums,  and  chains  of  intervals  to  be  put  in  the  same  tracks,  are  shown  in 
Figure  .3.2.  This  example  is  from  [9].  The  correctness  proof  of  the  algorithm 
follows  from  [19].  Now,  we  have  the  following  coioilary. 
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Figure  3.2:  (a)  A  channel  routing  instance,  (b)  corresponding  sorted  list  and 
prefix  sums,  (c)  chains  of  intervals 
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Corollary  3.4  Given  a  set  of  intervals  whose  terminals  are  sorted  as  described 
above,  the  line  packing  problem  can  be  solved  within  the  same  time  bound  as  that 
of  solving  the  ANSV  problem  on  any  of  the  parallel  models  considered  in  this 
thesis.  □ 

We  will  derive  the  corresponding  lower  bound  in  the  next  section. 


3.4  Lower  Bounds 

The  performance  of  all  the  algorithms  presented  has  degraded  by  a  factor  of  log  p 
in  the  transition  from  the  pipelined  hypercube  model  to  the  weak  hypercube 
model  or  to  any  of  the  related  bounded-degree  networks.  We  will  show  in  this 
section  that  these  upper  bounds  cannot  be  improved  on  the  bounded-degree 
networks  in  general  (i.e.  as  long  as  ^  >  log^p).  Before  presenting  the  proofs,  a 
couple  of  comments  concerning  the  general  network  model  are  in  order. 

For  all  the  problems  considered  in  this  chapter,  each  of  the  input  and  the 
output  can  be  efficiently  represented  by  an  array  of  data  items.  Let  A  = 
(cq,  Qi, . . . ,  a„_i)  be  such  an  input.  The  input  memory  map  7r,„  :  {0, 1, ....  n  — 
1 }  — ►  {0, 1, . . . ,  p—  1 }  specifies  the  index  mapping  of  the  elements  of  A  into  the 
local  memories,  say  {A/o,  A/i,  •  •  • ,  Mp-i},  of  the  p-processor  network.  For  all  the 
algorithms  presented,  7r,n  corresponds  to  the  consecutive  memory  mapping,  i.e., 
^m(j)  =  (assuming  as  usual  that  p  divides  n  evenly).  We  can  similarly  de¬ 
fine  the  output  memory  map  tToui  '■  {0, 1, . . . ,  n  —  1}  — ^  {0,  l,...,p  —  1),  where 
^out(j)  is  the  index  of  the  local  memory  containing  the  jth  data  item  of  the 
output  array.  Again  all  our  algorithms  generate  an  output  stored  in  consecutive 
order. 

If  we  make  the  assumption  that  7r,„  and  TTout  correspond  to  consecutive  stor¬ 
age,  then  the  lower  bound  of  f](2-!^)  can  be  established  for  the  weak  hypercube 
model  and  the  bounded-degree  networks  by  using  the  following  simple  technique. 
For  each  of  our  problems,  there  exist  instances  which  will  require  the  exchange 
of  the  data  in  the  local  memories  of  i}(p)  pairs  of  processors,  each  pair  with  a 
Hamming  distance  of  ll(logp).  Since  only  0(p)  data  items  can  be  communicated 
during  each  unit  of  time,  we  obtain  that  Q(  ”  time  is  needed  to  handle  the 
communication.  However  it  is  conceivable  that  a  problem  could  become  signif¬ 
icantly  simpler  if  the  input  memory  map  somehow  exploits  the  topology  of  the 
network  and  match  it  properly  with  the  problem.  Therefore  we  will  establish 
our  lower  bounds  under  the  assumptions  that  7r,>i  and  tToui  are  arbitrary,  data- 
independent  mappings  such  that,  for  each  j,  0  <  j  <  p—  1,  |{f|7rou((f)  =  )}|  =  ^ 
(i.e.  the  output  array  is  evenly  distributed  among  the  local  memories  of  the 
different  processors).  Lender  these  conditions  w’e  can  also  assume  that  the  in¬ 
put  array  is  evenly  distributed  among  the  local  memories  of  the  processors,  for 
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otherwise  balancing  the  data  alone  will  require  n(  on  the  bounded-degree 

networks  [35]. 

The  basic  technique  we  use  to  establish  all  of  our  lower  bounds  is  simple  and 
well-known.  Our  bounded-degree  networks  all  have  strong  separators. 

Since  the  separator  of  a  graph  is  a  communication  bottleneck,  if  we  can  show 
that  n(n)  data  items  have  to  be  exchanged  between  the  two  partitions  of  the 
processors  induced  by  a  separator,  then  the  lower  bound  of  n(  vvill  imme¬ 

diately  follow.  We  will  show  that  this  is  indeed  the  case  for  all  our  problems. 

We  begin  by  providing  the  lower  bound  for  the  merging  problem.  Let 
A  =  (flo,  ai,...,aa.  ,aa,...,a„_i)  be  an  array  such  that  subarrays  Aq  = 
(oo, . . . ,  )  and  Ai  =  (an, . . .  ,a„_i)  are  sorted  in  nondecreasing  order.  As¬ 

sume  for  simplicity  that  p  divides  n.  Processor  P.,  0  <  t  <  p  —  1,  has  ^  elements 
of  A  before  and  after  merging. 

Lemma  3.4  Let  PTq  and  PTi  be  an  arbitrary  partition  of  the  p  processors  such 
that  \PTo\  =  \PTi\  =  |.  Then  the  merging  problem  requires  Q{n)  data  exchanges 
between  PTq  and  PT\. 

Proof:  Without  loss  of  generality,  assume  that  at  least  j  elements  of  Aq  (the 
lower  half  of  A)  are  in  the  local  memories  of  PTi.  Since  iTout  is  data-independent, 
the  items  generated  in  PTq  are  of  fi.xed  ranks,  say  1  <  tq  <  ri  <  . . .  <  rn_i  <  n. 
Note  that  the  rank  of  each  element  a,  of  Aq  is  equal  to  f  +  1  (its  rank  in  Aq) 
plus  its  rank  in  Aj.  It  is  clear  that  A  can  be  chosen  so  that  the  rank  of  each 
0  <  i  <  ^  —  1,  is  exactly  r,.  But  in  this  case  all  the  elements  of  Aq  have 
to  appear  in  PTq.  By  our  assumption,  at  least  ^  of  these  elements  are  in  PTi. 
Therefore  n(n)  data  items  have  to  be  exchanged  between  PTo  and  PTi.  □ 

Corollary  3.5  The  problem  of  merging  two  sorted  sequences  each  of  length  n 
requires  time  steps  on  a  p-processor  butterfly,  shuffle-exchange,  or  cube- 

connected-cycles,  and  n(  -)  time  steps  on  the  weak  hypercube  model  with  p 
processors.  □ 

We  now  introduce  the  following  Restricted  ANSV  (RANSV)  which  will  be 
used  to  establish  the  lower  bounds  for  ANSV  and  for  the  problem  of  triangulating 
a  monotone  polygon.  The  input  consists  of  two  arrays  A  =  (oq,  Oj, . . . ,  an_i ) 
and  B  =  (6o,  6i, . . . ,  6„_i ),  each  sorted  in  non-decreasing  order.  The  output 
consist  of  two  arrays  A'  =  (/>(,, 6',,. . .  and  B'  =  (ao,a[, . . .  ,a(,_,),  where  h\ 

(respectively  a')  is  the  largest  element  in  B  (respectively  .4)  such  that  6'  <  a, 
(respectively  a'  <  6;).  Note  that  R.ANSV  is  obviously  equivalent  to  merging  in 
the  PRAM  model.  We  next  establish  a  lower  bound  for  this  problem. 

Lemma  3.5  Let  PTq  and  PT\  be  an  arbitrary  partition  of  p  processors  such 
that  \PTo\  =  \PTi\  =  The  problem  of  solving  R.A.\S\'  requires  Ll{n)  data 
exchanges  between  PTq  and  PT\. 


Proof:  Let  PTq  and  PTi  be  an  arbitrary  partition  of  the  p  processors  such  that 
|PTo|  =  IPTil  =  Let  tIq  and  be  the  numbers  of  elements  of  A  distributed 
in  PTq  and  P7’i  respectively.  VVe  can  define  Uq  and  nf  similarly.  Note  that 
Hq  =  nf  and  nf  =  nf.  Without  loss  of  generality  we  assume  that  nf  >  nf . 
Let  {aj(i), . . . ,  be  elements  from  A  that  are  in  PTq  and  {6^(1), . . . , 

be  elements  from  B  that  are  in  PTi,  where  j(l)  <  . . .  <  j(nf )  and  k{l)  <  . . .  < 
7(nf ).  Let  A  be  such  that  6fc(i)  is  the  largest  element  that  is  <  1  <  f  <  nf . 

Clearly,  such  arrays  A  and  B  exist  and  at  least  y  elements  must  pass  between 
PTq  and  PTi  to  find  the  matches  since  nf  >  j.  □ 

Corollary  3.6  The  A  NS  V  problem  requires  Q{  time  steps  on  a  p-processor 

butterfly,  shuffle- exchange,  or  cube-connected-cycles,  and  n(  time  steps 

on  the  weak  hypercube  model.  □ 

Proof:  Let  A  be  an  algorithm  of  running  time  T  for  solving  ANSV  with  the 
input  memory  map  7r,„  and  the  output  memory  map  iTout-  We  show  how  to 
solve  RANSV  in  time  T.  Let  A  and  B  be  the  input  arrays  to  RANSV.  Assume 
without  loss  of  generality  that  all  the  elements  of  A  and  B  are  distinct.  The 
input  to  ANSV  will  be  the  array  C  =  (A,  B'),  where  B'  is  the  array  B  given  in 
non-increasing  order.  Use  T,n  to  store  C  as  required  by  A.  Run  algorithm  A  on 
the  corresponding  input.  The  output  map  T^outij)  determines  the  local  memory 
containing  the  left  and  right  matches  of  the  jth  input.  Lei  Cj  =  a,  for  some  i. 
The  right  match  of  a,  is  precisely  the  largest  element  of  B  that  is  less  than  or 
equal  to  a;.  Similarly  for  the  case  when  cj  —  6,.  Hence  we  can  solve  RANSV  in 
time  T  using  algorithm  A.  □ 


Corollary  3.7  Given  a  monotone  polygon  P  with  its  n  sides  sorted,  triangu¬ 
lating  P  requires  Q(  time  steps  on  the  weak  hypercube  and  time 

steps  on  the  butterfly,  the  shuffle-exchange  or  the  cube-connected  cycles. 


Proof:  Let  A  =  (oo,  oi, . . . , a„_j )  and  B  =  (6o,  6i, . . . ,  6„_i)  be  two  arrays 
sorted  in  nondecreasing  order.  Without  loss  of  generality,  we  assume  that  all 
the  elements  are  distinct,  and  that  oq  <  bo,  a„_i  >  6„_i  and  uq  >  0.  Let 
/  =  On-i-  We  define  the  following  monotone  polygon  P.  The  upper  chain  of  the 
polygon  is 


and  the  lower  chain  is 


//  n\  +  IL  V%^  fflo-\-bx  v/57 
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riicii  l)y  1 1  ian;^iil;il,iiif!;  t.lic  raoiiol.onc  |»()lytv*M,  vvi‘  can  liiid  lor  <‘acli  a,  {h,)  l  li<‘ 
largest-  (‘Icinciil.  in  li  (.1)  that,  is  a,  (<  />,)  [7].  'I'licrcforc  t.lic  cdfrcs 
aiul  have  t.o  lie  Usiii}/;  t,lu‘  same  arj^uiiK'iit  as  m  la‘ii\ma  (i, 

we  coiH  hide  that.  I.licrc  are  some  instances  of  I, his  |)rol)lem  iccjiiirinf^  I  lie  specilieil 
amonni  ol  time  on  t  he  net  works.  □ 

W’e  now  consider  t  he  lower  hound  of  t  he  ot.lu'r  <'omi)nt  at  ional  j^eomet  ry  proh 
lem,  th<‘  AN.N  prohlem.  I  he  prohlem  KANSV'I  which  is  similar  to  H.-WSV 
and  usi'lnl  in  provinu;  the  lower  hound  ol  l.lie  /XNN  prohlem  can  he  delined 
.IS  follows;  'I  In-  injnit.  consists  of  t.wo  arrays  .d  =  ( f/o,  | ,  .  .  .  ,  n |  )  and  H 

I  ),  each  sorted  in  non  decreasinj^  order,  riieoiitpiit  consist  ol  two 

arrays  .1'  ~  . ^hi- i  )  ‘^"*1  —  ("i)i  "i' ■  ■  ■  '  "n- i  )'  whme  //  (respectively 

a')  is  t  he'  element,  in  //  ( resp<‘ct,i vely  A)  such  t.hat.  //  =  a,  (  respect  ivi-ly  (i[  -  /i,). 
il  it,  exists.  W’e  next,  estahlish  a  low<-r  hound  for  this  ])rohlem.  I  hc‘  prool  ol  the- 
lollowin^  lemm.i  is  simil.ir  to  that,  lor  1{ANSV  and  is  omittecl. 

laMiiina  .‘{.(5  /a  /  /''/],  <ukI  I‘I\  hr  an  nrhilrarij  jxii  titioii  of  ji  jiroci  xsors  sii<'li 
that  |/'/i)l  '  \A  l\\  ~  the  jirohliin  of  solvnifi  l{A.\’S\  I  riijiiti'is  U(n)  ilala 
tschniifics  hi  lii’ii  ii  l‘l\i  and  l'l\.  |_') 

'orollary  3.8  (iivcn  a  coninx  pipli/ifon  I’  irith  ils  ii  suits  sotliil,  solnni/  llu 

.1  .V A  pi'oltlt  III  n  tiuiri  s  1 1{  — )  time  sti  ps  on  tin  iri  tik  li  ijpi  ri  tiln  and  |  j(  '*  ) 

hull  sli  ps  on  the  hiilli  rjli/,  llu  sli iiJJh  -i  j-rliaiup  or  llu  ciilii  ■roiiiu  rlt  d  ri/rli  s. 

Proof:  We  now  rechice  |)rohlem  i{/\i\SVl  to  this  prohlem.  Let 
,1  -  ( (/(,,  a  I ,  .  .  .  ,  a  „  _  I  )  a  nd  1 1  {hoy  l>\ ^  ,  I'n  - \)  h<‘  t  wo  arrays  sot  t  ed  in  nonde 
ci'easini;,  oi'der.  Without  loss  of  f^enerality,  w«’  assume’  that,  a,  •  1  and 

li,^\  h,  >  1,1)’  I  ■  II  2,  and  that,  a,,  •'  h,),  ii„^t  >  and  a,,  >  I).  We 

deline  the  lollowinu;  convex  polygon  /'.  'The  nppc’r  <  hain  of  the  polviyon  is 

((n„,().:.).(a,,l)..7) _ (a,.  7)) 


and  t  he  lower  chain  is 

((n,„l)..')).(/«„,ll), (//,.()) . (/»„  ,.()),(a„  ,,l),.7)). 

II  the  nea  rest  poi  lit,  ol  ( a, ,  l),.7)  is  ( //^,  11).  for  some  0  •  i  ^  j  •  ii  I .  and  a,  h,. 
tlic’ii  a,  and  Ilf  are  the  matchinr'  c-h’iiients  in  tlii-  arrays  A  .ind  II.  I  hus  a 
solution  to  the  .\.\,\  prohlem  provicle.s  a  solution  to  the  l{.\.\S\  1  jnohlem  and 
t  he  lorollarv  lollows.  |  | 

W'c  now  .iddii'ss  the  \  l.Sl  routing,  prohlems  considered  in  the-  previous  sec 
tioii.  We  start  with  the  river  rcciitint';  inohlem.  1  he  solution  to  the  lollowiiu' 


problem  was  used  as  the  main  building  block  to  determine  the  minimum  sepa¬ 
ration.  Let  B  =  (bo,bi, . . . ,  )  and  T  =  (to,  ti, . . . ,  t„_i )  be  the  input  arrays 

and  5  =  (tj(o),  t;(i),  • .  ■ ,  be  the  output  array  as  before. 

Recall  that  the  minimum  separation  is  given  by  max,{i  —  j{i)  -f  1}  +  1.  Let 
<'  =  tj  —  j  —  \  and  6'  =  b,  —  i.  Then  j{i)  can  be  defined  as  j{i)  =  minj{_/'  < 
>\i'j  ^  This  problem  is  of  the  same  flavor  as  RANSV.  We  can  use  similar 

techniques  to  show  the  following. 

Lemma  3.7  Solving  the  above  problem  related  to  river  routing  requires  n(2i2££) 
time  on  a  p-processor  butterfly,  shuffle-exchange,  or  cube-connected-cycles,  and 

n(  time  on  the  weak  hypercube  model.  □ 

We  finally  discuss  the  lower  bound  proof  of  the  line  packing  problem.  Let  n  be 
the  number  of  intervals.  As  before,  we  assume  that  the  input  and  the  output  are 
given  in  arrays  A  =  (uq,  •  •  • ,  a2n-i ),  vvhere  is  a  triple  {Xj,  id{j),mark{j)). 
and  B  =  {bo,  6i, . . . ,  62n-i),  respectively.  Recall  that  xq  <  Xi  <  •  ■  •  <  X2n-\- 

Lemma  3.8  The  line  packing  problem  requires  n(  time  on  a  p-processor 

butterfly,  shuffle-exchange,  or  cube-connected-cycles,  and  n(  )  on  the  weak 

hypercube  model. 

Proof;  As  before  let  PTq  and  PTi  be  a  partition  of  the  processors.  There 
are  n  left  terminals  and  n  right  terminals  in  A.  Let  Ao  =  {ao, ...  ,0^-1)  and 
Ai  =  (a„, . . .  ,a2n-i)-  Let  Uq  and  nj  be  the  numbers  of  elements  of  .do  that  are 
stored  in  PTq  and  PTi  respectively,  nj  and  n}  can  be  defined  similarly.  Note 
that  Hg  =  n[  and  =  n”.  Without  loss  of  generality,  we  assume  Hq  >  n^  and 
Uq  is  an  even  number.  Then,  w'e  can  construct  an  instance  that  will  require  Tli  n ) 
data  exchanges.  Let  Aq  =  (0^(1), . . . ,  be  the  elements  of  Ao  in  PTq  such 

that  j(l)  <  •  •  •  <  jiuo).  Aj,  A°  and  .t}  can  be  defined  similarly.  We  construct 
^  intervals  for  .4“  as  follows;  j{i)  and  +  i)  define  the  left  and  right  ends  of 

an  interval,  for  1  <  ?  <  We  construct  ^  intervals  for  .4}  in  the  same  way. 
We  construct  intervals  by  letting  the  terminals  in  /I”  be  left  terminals  and  the 
terminals  in  .4^  be  right  terminals,  and  by  connecting  the  terminals  one  by  one. 
Clearly  this  is  a  valid  instance  for  the  problem.  Then  all  the  right  terminals  in 
.4g  have  their  successor  left  terminals  in  PT\  and  the  lemma  follows.  □ 


Chapter  4 


List  Ranking  and  Graph 
Algorithms 

4.1  Introduction 

The  main  goal  of  this  chapter  is  to  develop  optimal  network  algorithms  for 
non-numeric  problems  such  as  list  processing  and  graph-theoretic  problems:  list 
ranking,  tree  expression  evaluation,  connected  and  biconnected  components,  ear 
decomposition  and  st-numbering.  Given  a  linked  list,  the  list  ranking  problem 
is  to  find  the  distance  from  each  node  to  the  end  of  the  list.  We  present  an  0{^) 
time  optimal  algorithm  for  the  list  ranking  problem  on  the  pipelined  hypercube, 
when  n  =  for  any  positive  constant  e.  This  algorithm  is  used  to  develop 

optimal  algorithms  for  all  the  other  problems  on  our  networks.  We  also  proved 
some  lower  bounds  of  the  problems  on  the  weak  hypercube,  the  shuffle  exchange 
and  the  cube-connected  cycles.  All  the  algorithms  utilize  the  basic  results  in  the 
previous  chapters. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  4.2  deals  with  the 
list  ranking  problem.  Several  basic  graph-theoretic  problem  are  considered  in 
section  4.3. 

4.2  List  Ranking 

Given  a  linked  list  Wq,  W\,  . . w„-i  of  n  items  with  w,  following  in  the  list 
and  a  binary  associative  operation  *,  the  parallel  prefix  problem  is  to  compute 
all  n  initial  prefixes  iip,  it’o  +  a’i,  ....  u.’o*  Wi  *  •  •  •  *  U’n_i  in  parallel.  .\n  important 
special  case  is  the  list  ranking  problem  in  which  the  distance  from  each  node  to 
the  end  of  the  list  is  to  be  determined.  In  this  section,  an  optimal  hypercube 
algorithm  for  the  list  ranking  prol^lem  is  presented.  This  is  a  fundamental  list 
processing  problem  which  can  be  used  to  solve  many  graph-theoretic  problems. 
The  jiaralh’l  graph  algorithms  presented  in  the  next  section  will  makea  nontrivial 
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[M3]  Find  a  maximum  partition  of  G.  A  maximum  partition  of  6”  is  a  partition 
Vi  and  V2  of  V  such  that  the  sum  of  the  costs  of  the  edges  whose  end 
points  are  in  different  sets  is  no  less  than  that  of  any  other  partition. 

[M4]  Let  Vi,  V2  be  a  maximum  partition  of  V.  In  the  original  list,  mark  all  the 
links  ix,y)  to  be  in  the  independent  set  if  l{x)  G  K  and  l{y)  E  V2  (or  vice 
versa). 


We  are  ready  to  establish  a  couple  of  facts  about  the  above  algorithm. 

Lemma  4.1  In  step  [M2],  c{i,i)  =  0  and  c{i,  j)  =  n  —  1,  for  each  i.j  G  V  . 

□ 


In  the  graph  G  defined  by  step  [M2],  let  V  C  V  and  let  Ci'T{V',V  — 
F')  =  Z],eV'',jei'-v''(e(L i)  +  c(j,  f)).  A  partition  Ii  and  V2  of  V  is  maximal  if 
for  any  i  G  V^i,  CUT{Vx,V2)  >  CUT{Vi  —  {f},V2  U  {f})  and  for  anv  j  G  V2- 
CCT(V;,  V2)  >  CUTiW  U  {j},V2  -  {j}). 


Lemma  4.2  Let  1]  and  V2  be  a  maximal  partition  ofV.  Then  CUT{\\.\2)  > 
j.  Thus,  the  independent  set  in  [M4]  has  no  less  than  j  links. 

Proof:  For  any  t  G  F],  we  have 

^j6V2(e(L;) +  c(;,  i))  >  Ej€Vi(c(L  j) +c(j,  f)),  because  the  partition  is  maximal. 
We  have  a  similar  inequality  for  each  z  G  V2.  Thus, 

CUT{\\,V2)  = 

>  E..J6 V'.  c( t ,  i )  +  E ,,2e Vj  c( i,j) 

=  n-l-CLT(F„V2), 


and  CUT{Vx,V2)  >  ^.  □ 

We  can  easily  check  that  the  time  needed  to  execute  steps  [Ml],  [M2]  and 
[M  l]  is  O(^)  when  n  =  Step  [M3]  can  be  done  by  a  straightforward 

exponential  algorithm  since  |F[  =  O(log*^*n)  and  hence  2^3  =  (9(log'^“’' p). 
Therefore  we  have  the  following. 

Lemma  4.3  llVien  n  =  algorithm  INDEPENDENT  SET  finds  an 

independent  set  of  no  less  than  ^  links  in  time  (9(-)  on  the  pipelined  hypercube. 

□ 

Corollary  4.1  When  n  =  n(p‘+'),  algorithm  INDEPENDENT  SET  finds 
an  independent  set  of  no  less  than  j  links  in  time  0{^  logp)  on  the  weak  hyper- 
cube,  the  shuffle-exchange  and  the  cube-connected  cycles.  □ 


61 


Corollary  4.2  For  any  n  >  p,  algorithm  INDEPENDENT  SET  finds  an 
independent  set  of  no  less  than  j  links  in  time  0{ "  -j-  log^  p)  on  the  pipelined 

hypercube  by  using  the  mergesort  algorithm  in  section  2.4-  □ 

We  are  ready  to  describe  the  list  ranking  algorithm.  The  well  known  overall 
strategy  is  to  identify  a  large  independent  set,  contract  the  links  in  this  set,  and 
repeat  this  process  until  the  length  of  the  list  is  small  enough.  For  the  remaining 
short  list,  we  can  use  VV’yllie’s  algorithm  [87]  which  consists  of  contracting  the 
list  O(logn)  times  by  using  the  path  doubling  technique  at  each  iteration.  The 
following  procedure  describes  the  overall  strategy. 

procedure  LIST  RANKING: 

[Pi]  Execute  steps  [P2]  -  [P4]  until  the  number  of  remaining  nodes  is  no  more 
than  O(loglogn)  executions  are  necessary. 

[P2]  Find  an  independent  set  with  no  less  than  j  links  by  using  algorithm 

INDEPENDENT  SET. 

[P3]  Contract  all  the  links  in  the  independent  set. 

[P-l]  The  size  of  the  collapsed  list  is  no  more  than  The  remaining  links 
are  distributed  evenly  by  using  the  BALANCE  algorithm  of  section  2.3. 
Successor  field  of  each  link  is  modified  such  that  the  redistributed  links 
constitute  a  linked  list. 

[Po]  Apply  Wyllie’s  algorithm  to  find  the  list  ranking  value  of  the  remaining 
list. 

[P6]  We  restore  the  linked  list  and  compute  the  list  ranking  value  of  the  original 
list.  This  step  is  similar  to  step  [P4]. 


When  n  =  Cl{p^'*'^),  single  execution  of  steps  [P2].  [P3],  [P4]  and  [P6]  can  be 
done  in  time  O(^)  on  the  pipelined  hypercube.  Thus,  the  total  communication 
and  computation  time  to  execute  these  steps  satisfies  the  recurrence 

T,{n}  =  tA  +  01-) 

4  p 

and  hence  Tp{n)  =  0{  j).  Step  [P5]  can  be  performed  in  time  0{{  log  n)  — 

O(^).  Therefore  the  overall  time  complexity  is  O(^). 

Theorem  4.1  \Vh  en  n  =  Q(p^'^^),  the  list  ranking  problem  can  be  solved  on  the 
pipelined  hypercubc  in  time  0{^).  □ 
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Corollary  4.3  When  n  =  Q(p^'^‘),  the  list  ranking  problem  can  be  solved  in  time 
O(^logp)  on  the  weak  hypercube,  the  shuffle-exchange  and  the  cube-connected 
cycles.  □ 

Corollary  4.4  For  any  n  >  p,  the  list  ranking  problem  can  be  solved  on  the 
pipelined  hypercube  in  time  "12S"  log^p)  by  using  the  mergesort  algorithm 
in  section  2.4-  □ 

A  natural  question  is  whether  the  weak  hypercube  algorithm  can  be  im¬ 
proved.  We  show  next  that  this  is  not  possible. 

Lemma  4.4  The  list  ranking  problem  of  n  links  on  the  weak  hypercube  requires 
Q{^\ogp)  time,  when  the  links  are  initially  evenly  distributed  over  the  p  proces¬ 
sors. 

Proof;  Consider  a  routing  problem  similar  to  that  of  section  2.2  that  moves  j 
data  items  from  P,  to  0  <  i  <  p  —  1 ,  where  £'(«)  =  (i-(-0101...0l2)  mod  p. 
This  routing  problem  requires  ri(  )  time  on  the  weak  hypercube.  Now  we 
reduce  this  routing  problem  to  the  parallel  prefix  algorithm.  For  each  data  item 
in  P,,  we  create  a  linked  list  with  only  one  link,  beginning  at  P,,  with  the  node 
value  being  the  data  value  to  be  moved,  and  ending  in  P£(,)  with  the  node  value 
being  0  (the  identity  for  operator  *).  Clearly  a  solution  to  this  parallel  prefix 
problem  will  solve  the  given  routing  problem  and  hence  the  lemma  follows.  □ 

We  can  also  prove  the  same  lower  bound  of  the  problem  on  the  shuffle- 
exchange  and  the  cube-connected  cycles  in  a  similar  way. 

4.3  Graph  Problems 

In  this  section,  we  describe  parallel  algorithms  for  several  well-known  graph 
problems.  The  problems  include  tree  expression  evaluation.  Euler  tour  on  trees, 
finding  lowest  common  ancestors,  connectivity,  biconnectivity,  strong  orienta¬ 
tion.  ear  decomposition,  Euler  tour  on  graphs,  graph  coloring  and  finding  max¬ 
imal  independent  sets  (see  Tables  1.1  and  1.2).  These  algorithms  utilize  the 
algorithms  for  load  balancing,  integer  sorting  and  list  ranking  of  sections  2.3. 
and  2. 1  and  -1.2,  respectively.  We  start  by  discussing  those  problems  for  which 
efficient  algorithms  are  developed.  These  include  computing  various  tree  func¬ 
tions  and  planar  graph  problems.  A  common  strategy  that  works  well  for  all 
these  problems  consists  of  reducing  the  size  of  the  problem  by  using  an  efficient 
hypercube  algorithm  and  then  applying  a  fast  shared  memory  algorithm  on  the 
reduced  size  problem. 
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Theorem  4.2  The  following  problems  can  be  solved  in  time  O(^)  on  the  pipelined 
hypercube,  when  n  =  Tl{p^'^‘). 

1.  Tree  expression  evaluation. 

2.  Euler  tours  on  trees,  and  hence  all  the  basic  tree  functions. 

3.  .Maximal  independent  set,  connected  components,  biconnected  components, 
strong  orientation,  ear  decomposition,  st-numbering,  and  Euler  tour  of  planar 
graphs. 

Proof  Sketch:  All  the  algorithms  used  have  been  reported  in  the  literature. 
See  Tables  4.1  and  4.2  for  appropriate  references.  However  we  use  our  routing 
and  list  ranking  algorithms  to  achieve  the  time  bound  stated  in  the  theorem.  □ 

Corollary  4.5  The  problems  in  the  above  theorem  can  be  solved  in  time  0(  ^  log  p) 
on  the  weak  hypercube,  the  shuffle-exchange  and  the  cube-connected  cycles,  when 
n  =  n(p‘+'). 

Corollary  4.6  The  problems  in  the  above  theorem  can  be  solved  in  time  0{ 
log  p)  on  the  pipelined  hypercube,  for  any  n  >  p. 

We  now  consider  the  important  problem  of  finding  the  connected  components 
of  arbitrary  graphs.  There  are  two  well-known  shared  memory  algorithms:  .An 
0(y-t-  log^  n)  algorithm  when  the  input  is  given  cis  an  adjacency  matrix  [11,84]. 
and  an  O(^^^logn)  algorithm  when  the  input  is  given  as  an  edge  list  [74]  with 
n  vertices  and  m  edges.  VV^e  call  the  latter  one  the  “SV-algorithm”. 

It  can  be  easily  verified  that  the  above  two  algorithms  can  be  implemented 
directly  on  the  pipelined  hypercube  in  time  0{—)  and  (9(^^^^logn)  when  > 
and  n  m>  p*"*"',  respectively,  by  using  our  previous  algorithms. 

However,  if  the  adjacency  matrix  of  a  graph  is  given  as  input,  there  is  an 
efficient  algorithm  even  for  the  weak  hypercube,  the  shuffle-exchange  and  the 
cube-connected  cycles  [1]. 

Lemma  4.5  Given  the  adjacency  matrix  of  a  graph,  the  connected  components 
can  be  found  in  time  0{^)  on  the  weak  hypercube,  the  shuffle-exchange  and  the 

2 

cube-connected  cycles,  when  p  < 

W^e  now  consider  the  case  of  sparse  graphs.  A  faster  algorithm  is  known  and 
is  efficient  for  all  graphs  e.xcept  those  which  are  extremely  sparse  [17].  Using 
the  strategy  of  [17],  we  derive  a  simple  algorithm  for  finding  the  connected 
components  of  a  graph  in  time  0(  log  log  rr )  on  the  pipelined  hyperciibe. 
where  the  input  is  given  as  an  edge  list  and  n  +  m  >  p*'*"'.  Since  we  will  be 
using  the  SV-algorithm  in  our  description,  we  present  a  brief  outline  of  the  main 
strategy. 
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The  output  of  the  algorithm  is  a  vector  D[\  . .  .n]  such  that  D{u)  =  D[v) 
if  and  only  if  vertex  u  and  v  are  in  the  same  connected  component.  Initially. 
D{v)  —  t’,  for  each  v  =  1, . . .  ,n.  Notice  that  the  vector  D  forms  a  forest  during 
the  execution  of  the  algorithm.  The  al5orithm  performs  the  following  steps 
O(logn)  iterations: 

1.  Shortcutting.  Set  D{i)  =  D{D{i)),  1  <  i  <  n. 

2.  Hooking  trees  onto  smaller  vertices  of  other  trees.  If  D[i)  did  not  change  in 
step  1,  then  if  there  exists  j  such  that  {i,j)  is  an  edge  of  G  and  D{j)  <  D{i) 
then  set  D{D{i))  =  D(j). 

3.  If  there  is  a  tree  that  did  not  change  in  steps  1  and  2,  then  hook  such  a  tree 
onto  another  tree  if  possible. 

We  now  begin  the  description  of  our  algorithm.  At  each  step  of  the  algo¬ 
rithm,  a  set  of  vertices  with  the  same  D  value  is  referred  to  as  a  supervertex . 
Each  edge  (u,  v)  in  the  input  graph  induces  an  edge  connecting  the  supervertex 
containing  u  with  the  supervertex  containing  v.  The  graph  whose  vertices  are 
the  supervertices  and  the  edges  are  these  induced  edges  is  called  the  supervertex 
graph.  Giv^en  a  supervertex  graph,  an  edge  in  G  is  redundant  (with  respect  to 
the  supervertex  graph)  if  both  of  its  endpoints  lie  in  the  same  supervertex.  .An 
edge  is  an  outedge  if  it  is  not  redundant.  If  several  outedges  connect  the  same 
pair  of  supervertices,  one  of  these  outedges  is  chosen  to  be  the  actual  outedge: 
the  other  outedges  are  called  duplicate  outedges.  The  rule  for  choosing  the  ac¬ 
tual  edge  is  arbitrary.  The  degree  of  a  supervertex  u,  degree{v),  is  defined  to 
be  the  number  of  actual  outedges  incident  on  v.  Initially,  the  input  graph  G  is 
the  supervertex  graph  in  which  V  is  the  set  of  supervertices  and  E  is  the  set  of 
actual  outedges. 

procedure  CONNECTIVITY: 

[Cl]  Execute  steps  [C2]  -  [C5|  until  the  number  of  remaining  vertices  and  edges 
are  no  more  than  and  respectively.  For  the  Tth  iteration,  let  u, 
be  the  number  of  vertices  in  the  non-isolated  supervertex  containing  the 
fewest  number  of  vertices.  Then  we  set  d  =  y/rxi- 

[C2j  For  each  supervertex  v,  select  exactly  d  actual  outedges  of  v  if  degree{v)  > 
d  or  select  all  its  actual  outedges,  otherwise. 

[C3]  Run  the  .SV  algorithm  [logs  j]  -t-  1  iterations  on  the  graph  induced  by  the 
edges  selected  in  step  [C2]. 

[C  l]  The  output  of  step  [C3]  is  a  rooted  forest  of  its  supervertices.  Contract 
this  rooted  forest  into  rooted  stars  (rooted  trees  of  height  1). 

[C5]  Construct  the  new  supervertex  graph  for  the  next  iteration.  To  do  this, 
modify  vector  D  and  delete  all  redundant  edges,  duplicate  outedges  and 
isolated  supervertices. 


[C6]  For  the  supervertex  graph  with  vertices  and  edges  no  more  than  and 
respectively,  run  the  SV  algorithm  to  find  its  connected  components. 
Finally,  modify  vector  D  to  represent  the  connectivity  of  the  input  graph 

G. 

Lemma  4.6  For  each  iteration,  step  [C3]  can  be  executed  in  time  on 

the  pipelined  hypercube  when  m  +  n  = 

Proof:  The  number  of  edges  of  the  f-th  iteration  \s  0(-^  ■  d)  =  0{^).  Since 
the  number  of  edges  for  the  SV  algorithm  is  0{  ),  the  time  for  step  [C3]  is 

Lemma  4.7  After  O(loglogn)  iterations  of  steps  [C2]  -  [C5],  there  it,  no  non¬ 
isolated  supervertex. 

Proof:  In  the  graph  comprising  the  supervertices  and  the  edges  selected  in  step 
[C2],  there  are  two  kinds  of  components: 

1.  Components  with  more  than  d  supervertices. 

2.  Components  with  no  more  than  d  supervertices. 

Supervertices  of  the  components  of  type  2  belong  to  the  same  rooted  tree  after 
step  [C3]  and  removed  as  an  isolated  supervertex  in  step  [C5].  Supervertices  of 
the  components  of  type  1  are  partitioned  into  one  or  more  rooted  forest  each 

3. 

containing  at  least  d  supervertices  after  step  [C3].  Thus,  n,+i  >  n,  -d  =  Since 
>  ^2^^  >  2*?*'"^,  after  +  2  iterations,  there  can  be  no  non-isolated 

supervertex.  □ 

Theorem  4.3  IfTien  m  +  n  =  i}.{p^^‘),  the  connected  components  of  a  graph 
can  be  found  in  time  0(  log  log  n)  on  the  pipelined  hypercube. 

Proof:  Steps  [C2]  and  [C5]  can  be  done  in  time  by  using  the  routing 

algorithm  of  section  2.5.  Step  [C4]  can  be  done  within  the  same  time  bound  with 
the  Euler  tour  technique  and  list  ranking.  Step  [C6]  can  be  done  in  time  0{ 
logn)  =  Thus  the  total  time  for  this  algorithm  is  log  log  n).  □ 

Corollary  4.7  When  m  +  n  =  Q{p^'^^),  the  connected  components  of  a  graph 
can  be  found  in  time  0{  *°SP  [qg  log  n)  on  the  weak  hypercube,  the  shuffle- 
exchange  and  the  cube-connected  cycles. 

Corollary  4.8  For  any  rn,n  >  p,  the  connected  components  of  a  graph  can  be 
found  in  time  q.  log^  p)  log  log  n )  on  the  pipelined  hypercube. 
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One  can  easily  show  that  the  following  graph  problems  can  be  solved  within 
the  same  time  bounds  on  the  hypercube  as  the  connected  components  problem 
(for  both  forms  of  the  input):  biconnected  components,  strong  orientation,  ear 
decomposition,  and  st-numbering. 

We  now  prove  a  lower  bound  result  for  the  weak  hypercube,  the  shuffle- 
exchange  and  the  cube-connected  cycles.  The  lower  bound  shown  for  all  the 
above  graph  problems  is  fl(^logp).  We  also  present  the  result  that  shows  that 
it  is  possible  to  find  the  connected  components  faster  than  O(^^^logp)  in  some 
cases  on  the  weak  hypercube. 

Lemma  4.8  Solving  any  graph  problem  mentioned  in  this  section  requires 
n(Mogp)  time  on  the  weak  hypercube,  the  shuffle-exchange  and  the  cube-connected 
cycles. 

Proof:  We  prove  this  theorem  only  for  the  graph  connectivity  problem.  The 
proofs  for  the  other  problems  are  similar.  Assume  with  out  loss  of  generality 
that  p  =  2*^  =  2^*^.  Let  Ei{i)  =  (i  -f  001001 ...  OOI2)  mod  p,  £'2(0  =  {i  + 
010010 . . .  OIO2)  mod  p  and  £3(1)  =  {i  +  011011 . . .  OII2)  mod  p.  0  <  c'  <  p  —  1. 
Then  we  can  prove  by  induction  that  the  Hamming  distance  of  any  two  of 
[i,  Ei{i),  £2(1),  £3(1)]  is  !n(logp).  Consider  the  following  instance:  n  =  m  —  Al 
for  some  integer  /  and  each  connected  component  is  a  cycle  (vjo' of 
length  4,  0  <  j  <  /  —  1.  If  we  assume  that  the  two  edges  incident  on  are  in 
Pi  and  those  incident  on  Uj,  in  ££,(,)  and  so  on,  then  the  time  it  takes  to  label 
the  vertices  of  each  connected  component  with  the  same  label  is  H(logp)  and 
the  total  time  is  therefore  n(  The  lower  bound  for  the  shuffle-exchange 

and  the  cube-connected  cycles  can  be  proved  similarly.  □ 

Lemma  4.9  The  connected  components  of  a  graph  G  =  {V,£),  where  |V'|  =  n 
and  IL]  =  m,  can  be  obtained  in  0{{k^  -j-  {k  -|-  1)  log  logn  log  logp),  for 
any  positive  integer  constant  k,  when  m  =  n(plogp),  on  the  weak  hypercube, 
the  shuffle-exchange  and  the  cube-connected  cycles.  When  n  <  —  the  time 

complexity  is  0(k^  log  log  n  log  log p). 

Proof:  The  connected  components  of  a  graph  G  can  be  obtained  in 
0(y  loglogn  logp)  on  the  weak  hypercube,  the  shuffle-exchange  and  the  cube- 
connected  cycles  if  m  >  p''^'.  With  this  fact,  the  following  algorithm  finds  the 
connected  components  within  the  time  bound  claimed  in  the  statement  of  the 
lemma. 

( 1 )  For  each  subcube  of  size  log*  p,  find  the  connected  components.  Then  each 
processor  has  to  hold  only  edges.  This  step  takes 

0( k  y  log  log  n  log  log  p)  time. 
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(2)  For  each  subcube  of  size  log^"*"’  p,  find  the  corresponding  connected  com¬ 
ponents.  Then  each  processor  has  to  hold  edges.  The  execution 

time  of  this  step  is  0((A:  -f-  1 )  -  log  log  n  log  logp). 

(3)  Find  the  connected  components  on  all  processors.  This  step  takes 

For  example,  if  p  =  >/n  and  m  >  then  the  running  time  of  the  above 
algorithm  is  0(^{log  log  n)^). 
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Chapter  5 


One-Layer  Routing 


5.1  Introduction 

It  is  well-known  that  many  of  the  optimization  problems  arising  in  \'LSI  routing 
are  XP-completc  [41,16,67,76].  One  notable  exception  is  the  class  of  one-layer 
routing  problems  associated  with  a  hierarchical  layout  strategy  such  as  Bristle- 
Blocks  [36].  See  [13,20,48,49,51,53,59,72,79]  for  more  examples.  Efficient  serial 
solutions  have  already  appeared  in  the  literature  for  most  of  those  problems, 
t  or  parallel  solutions,  efficient  algorithms  that  run  in  time  (9(logn)  on  CREW 
PRAM  and  in  time  0{  on  Common  CRCW  PRAM  were  developed  for 

several  onc-layer  routing  problems  [10].  In  this  chapter,  fast  parallel  algorithms 
for  the  one-layer  routing  problems  on  the  hypercube,  the  shuffle-exchange,  the 
cube-connected  cycles  and  the  butterfly  are  presented. 

The  class  of  general  one-layer  routing  problems  involves  routing  between  or¬ 
dered  sequences  of  terminals  such  that  the  final  layout  is  planar.  One  such  prob¬ 
lem  (river  routing)  is  the  wiring  of  two  ordered  sets  of  terminals  {61.62 . 6„} 

and  {<1,  <2i  •  •  •  1 1n}  across  a  channel  between  the  parallel  boundaries  of  two  rect¬ 
angles.  The  width  of  the  channel  is  the  vertical  distance  between  the  two  lines 
forming  the  channel.  The  sepai'ation  problem  is  to  find  the  minimum  width  of 
the  channel  necessary  to  wire  all  nets  such  that  any  two  wires  are  separated  by 
a  unit  distance.  We  will  restrict  ourselves  to  the  case  where  the  wires  arc  recti¬ 
linear,  i.e..  there  is  a  grid  structure  such  that  each  wire  consists  of  a  connected 
s('t  of  grid  line  segments.  This  problem  can  be  solved  in  time  0(^-1-  logp)  on 
the  pipelined  hypercube. 

.‘\  more  general  version  of  the  river  routing  problem  that  is  known  to  have  an 
efficient  serial  algorithm  is  to  perform  planar  routing  where  the  ports  lie  on  the 
boundary  of  a  simple  rectilinear  polygon  [59].  In  this  case,  we  are  interested  in 
whether  the  routing  is  possible  or  not  and,  in  the  affirmative,  we  have  to  provide 
the  detailed  routing.  Several  interesting  subproblems  such  as  finding  the  contour 
of  the  union  of  a  set  of  rectilinear  polygons  or  determining  whether  a  set  of  nets 
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can  be  wired  within  a  set  of  “passages”  are  also  tackled.  All  these  problem  can 
be  solved  in  time  0{^l^)  on  the  pipelined  hypercube,  when  n  =  for  any 

/  >  0.  These  algorithms  can  be  also  implemented  in  time  (9(  +  log"' p)  on 

the  pipelined  hypercube. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  5.2  deals  with  the 
separation  problem.  Sections  5.3  and  5.4  deal  with  the  routing  problem  and  the 
routability  testing  problem  within  a  rectilinear  polygon. 


5.2  The  Separation  Problem 

Let  {.'V,  =<  b,,t,  >  I  1  <  f  <  n}  be  an  instance  of  the  channel  separation 
problem.  Notice  that  6,  and  t,  will  be  also  used  to  denote  the  horizontal  coor¬ 
dinates  of  the  terminals  relative  to  an  arbitrary  origin.  We  make  the  reasonable 
assumption  that  all  terminals  lie  in  an  interval  [0,  iV],  where  N  =  0(n). 

A  net  N,  is  a  right  net  if  6,  <  i,,  it  is  a  left  net  if  6,  >  t,,  and  it  is  a  vertical  net 
otherwise.  We  can  partition  the  nets  into  right  blocks,  left  blocks  and  vertical 
blocks.  A  set  of  right  nets  Ni,  Ni+i, . , . ,  l\ic  is  a  right  block  if  it  is  maximal  with 
the  property  bj  <  bj+i  <  for  i  <  j  <  k.  We  can  similarly  define  left  blocks 
and  vertical  blocks. 

The  wiring  problem  is  reduced  to  wiring  each  block  separately.  We  will 
concentrate  on  the  wiring  of  right  blocks.  Obvious  changes  can  be  made  to 
deduce  the  corresponding  algorithm  for  'eft  blocks.  An  efficient  strategy  for  right 
blocks  consists  of  wiring  the  nets  from  left  to  right  such  that  for  each  net  we  move 
from  the  bottom  terminal  upward  and  try  to  stay  as  close  to  the  upper  row  as 
possible  [20,59,72].  The  w'iring  of  a  net  can  be  specified  by  the  coordinates  of  its 
bend  points.  For  example,  net  A'l  of  Figure  5.1  has  the  bend  points  An .  ■  For 

each  net  .V,,  we  have  2k  bend  points,  A,i,  A, 2, . .  • ,  A,*  and  i?,i,  B,2 . B,k.  for 

some  k.  .Not  all  of  these  bend  points  are  needed  to  determine  the  overall  wiring. 
We  call  A,i  and  B,i  (bend  points  closest  to  the  bottom  row)  the  characteristic 
bend  points  of  .V,.  Notice  that  the  characteristic  bend  points  uniquely  define  the 
overall  wiring  since  once  we  have  the  wiring  of  A',_i  and  the  characteristic  bend 
points  A.i  and  /?,i  of  A',,  we  can  easily  determine  all  the  other  herd  points  of  .V,. 
Figure  5.1  shows  an  example  of  a  river  routing  problem  and  a  wiring  achieving 
the  minimum  separation.  The  algorithm  to  find  the  minimum  separation  is 
based  on  the  following  lemma. 

Lemma  5.1  Let  N,  be  a  net  in  a  right  block  and  let  j  be  the  minimum  j  <  i 
such  that  fj  +  (i  —  j  —  1)  >  b,.  Then  the  coordinates  of  the  characteristic  bend 
points  of  .V,  are  A,i  =  {b,,  i  —  j  +  \  )  and  B,i  =  {tj  +  i  —  j.i  -  j  +  \). 

Proof.  Since  j  is  the  minimum  j  <  i  such  that  +  (?  —  j  —  1)  >  b,.  there  is  no 
terminal  point  at  {t^  —  1,0).  Hence  there  is  a  bend  point  at  [t^,  1)  for  net  .\y 
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Figure  5.1:  Basic  river  routing  problem 

The  number  of  vertical  grid  lines  between  and  6,  is  6,  -  tj  and  hence  smaller 
than  the  number  of  nets  between  N-^  and  Ni,i.e.,  i  -  ;  +  1  horizontal  tracks  are 
needed  to  route  net  A,-.  A  simple  argument  will  show  that  the  coordinates  of 
the  characteristic  bend  points  have  the  values  stated  in  the  lemma.  □ 

The  following  procedure  computes  such  an  index  j{i)  for  each  net  A’,  of  a 
right  block  {A',|l  <  i  <7i}. 

procedure  Index 

1.  Compute  b[  =  6.  -  I  and  =  tk  -  k  ~  I  for  each  i  and  k.  Notice  that 

b'l  S  b'2  S  ■  ■  ■  <  and  t\  <  <  ■  ■  ■  < 

2.  Merge  the  tuo  secjuences.  If  then  put  b'^  before  in  the  merged 

secpience. 

.3.  For  each  b\,  find  the  nearest  to  the  right  in  the  merged  sequence  Then 
]{i)  ^  k. 


The  correctness  proof  of  the  above  algorithm  is  straightforward.  We  can 
use  the  merge  algorithm  in  subsection  2.4.1  for  step  2.  Step  3  can  be  done 
by  performing  the  prefix  sums  operation.  Thus  the  above  algorithm  can  be 
implemented  in  time  0(^4-  logp)  on  the  pipelined  hypercube.  We  now  give 
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the  algorithm  to  find  the  minimum  separation  as  well  as  the  characteristic  bend 
points  of  all  the  nets. 

procedure  Separation 

1.  Partition  the  nets  into  blocks. 

2.  Apply  algorithm  Index  to  get  the  index  j(i)  for  each  net  iV;.  Use  Lemma 
5.1  to  obtain  all  the  characteristic  bend  points. 

3.  Let  the  characteristic  bend  points  be  Bn  —  (Ta,2/ii),  1  <  z  <  n.  Then  the 
minimum  separation  is  max{7/ii, . . . ,  i/ni }  +  L 

Theorem  5.1  Algorithm  Separation  finds  the  characteristic  bend  points  of  the 
n  input  nets  and  the  minimum  channel  separation  in  time  0(^  +  logp)  on  the 
pipelined  hypercube.  □ 

Corollary  5.1  Algorithm  Separation  finds  the  characteristic  bend  points  of 
the  n  input  nets  and  the  minimum  channel  separation  in  time  0(  on  the 

weak  hypercube,  the  shuffle-exchange,  the  cube-connected  cycles  and  the  butterfly. 

□ 


5.3  Routing  In  a  Simple  Polygon 

The  routing  problem  of  nets  within  a  simple  rectilinear  polygon  introduced  in 
[591  is  a  generalization  of  the  standard  river  routing  problem.  In  this  case  we 
are  supposed  to  connect  a  set  of  terminals  {oi,  02, . . . ,  a^}  on  the  boundary  of 
a  simple  rectilinear  polygon  to  another  set  of  terminals  {^i ,  62, . . . ,  6n}  on  the 
boundary  of  the  same  polygon  such  that  all  the  wires  lie  within  the  polygon 
and  no  two  wires  intersect.  Routability  testing  is  to  determine  whether  or  not 
a  one  layer  routing  is  possible,  and  detailed  routing  is  to  specify  the  actual 
wiring  of  the  n  nets,  if  they  are  routable.  We  will  start  by  discussing  a  version 
of  the  detailed  routing  problem  whose  solution  will  be  used  in  the  routabilitv 
testing  algorithm.  Routabdity  testing  will  be  discussed  in  the  next  section.  We 
will  restrict  ourselves  to  the  rectangle  case.  However  all  the  algorithms  can  be 
generalized  to  any  rectilinear  polygon.  We  assume  that  the  x  and  y  coordinates 
of  the  terminals  are  integers  which  lie  in  an  interval  [0,  iV],  where  .V  =  0[n). 

We  will  begin  with  a  few  definitions.  Let  {iV,  =<  a,,  fe,  >  |  1  <  z  <  n)  be 
the  set  of  n  input  nets  whose  terminals  lie  on  the  boundary  of  a  rectangle  R. 
Let  the  lower  left  corner  of  R  be  (0,0),  the  origin  of  an  (x,  z/)  coordinate  system. 
The  four  corners  of  R  have  coordinates  (0,0),  (/,  0),  (/,/z)  and  (0,/z),  where  / 
and  h  are  respectively  the  length  and  the  height  of  R.  If  we  cut  R  at  (0,0)  and 
straighten  the  boundary  counterclockwise  into  a  line,  the  corresponding  linear 
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Figure  5.2;  Basic  river  routing  around  a  rectangle  boundary 

coordinate  of  a  point  uj  on  the  boundary  will  be  denoted  by  d(w).  It  is  trivial 
to  compute  d{w)  from  the  two-dimensional  coordinate  of  u). 

Let  iV,  =<  a,,  6,  >  be  an  arbitrary  net.  The  terminals  a,  and  6,  divide  the 
boundary  of  R  into  two  parts.  The  part  of  length  <  h  +  I  will  be  called  the 
internal  boundary  of  N,.  The  other  part  will  be  called  the  external  boundary. 
We  assume  without  loss  of  generality  that  the  internal  boundary  of  A^,  begins 
with  a,  and  ends  with  6,  counieiclockvvise.  We  call  a,  and  6,  the  left  terminal 
and  the  right  terminal  of  A',,  respectively.  Without  loss  of  generality,  we  can 
also  assume  that  d{ai)  <  d(a2)  <  •  •  •  <  d(a„). 

A  net  N,  covers  another  net  Nj  if  the  internal  boundary  of  N,  properly 
contains  that  of  A'j.  A  representative  net  is  a  net  that  is  not  covered  by  any 
other  net.  Figure  5.2  shows  an  example  of  a  detailed  routing  problem  such 
that  iVi,  AV>  and  A"i4  are  the  representative  nets.  We  can  partition  the  nets 
into  groups  such  that  each  group  consists  of  a  representati'ce  net  and  all  the 
nets  covered  by  it.  .Notice  that  the  nets  in  each  group  appear  consecutively 
in  a  circular  fashion  in  R.  The  groups  in  Figure  5.2  are  {.Vi,  .Nb,  .V3,  .V^. -Vs}. 
{As,  At,  A^8,  Ag,  iVio,  A^i,  Nn,  A13},  and  A15}. 

Clearly,  if  a  given  instance  of  the  above  problem  is  routable,  then  its  routing 
can  be  performed  by  routing  each  group  of  nets  separately.  Thus,  the  general 
strategy  for  specifying  the  routing  will  be  the  following:  (i)  identify  the  repre¬ 
sentative  nets,  (ii)  partition  the  nets  into  groups  of  nets,  and  (iii)  specify  the 
routing  of  the  representative  net  of  each  group.  The  following  algorithm  handles 
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(i)  and  (ii). 

procedure  Representative  Nets 

1.  Mark  all  the  nets  whose  internal  boundaries  contain  the  corner  (0,0).  If 
no  such  net  exists,  go  to  step  3. 

2.  Let  ,  yV,^  be  all  the  nets  that  are  marked  and  d(a,, )  <  d(a,^)  < 

■  ■  ■  <  d{ai^).  Then,  clearly  .iV,,  is  a  representative  net  and  it  covers  all  the 
other  k  —  I  nets.  Find  all  the  nets  covered  by  Ni^  and  remove  them  with 
yV,j  from  the  input. 

3.  Sort  all  the  remaining  terminals  according  to  their  d-values.  This  step  can 
be  done  by  using  the  cubesort  or  the  mergesort  algorithm  in  chapter  2. 

4.  Assign  +1  to  the  left  terminal  and  —1  to  the  right  terminal  of  each  net  and 
compute  prefix  sums  of  all  the  terminals.  Clearly,  N,  is  a  representative 
net  if  and  only  if  the  prefix  sum  value  of  a,  is  1. 

5.  For  each  representative  net,  find  all  the  nets  that  it  covers.  Notice  that  if 
iV,  and  Nj  are  two  adjacent  representative  nets  such  that  d(a,)  <  d{aj). 
then  nets  ^V.+i, . . . ,  Nj-i  are  covered  by  Ni. 

Lemma  5.2  Let  n  be  the  number  of  input  nets.  The  representative  nets  and  the 
corresponding  groups  can  be  found  in  time  0{^l^)  on  the  pipelined  hypercube, 

when  n  =  for  any  /  >  0.  □ 

We  now  turn  to  the  problem  that  routes  each  group  separately.  Our  goal  here 
is  to  identify  the  bend  points  of  each  representative  net.  Note  that  in  general 
the  total  number  of  bend  points  of  all  the  nets  could  be  However  the 

total  number  of  bend  points  of  the  representative  nets  is  always  0{n). 

Lemma  5.3  Let  Wi ,  . . . ,  W*  be  all  the  representative  nets  and  let  L(.\r,)  be 

the  number  of  nets  in  the  internal  boundary  of  S' r,.  Then  ( ^( -W, )  +  1)  =  n. 
Moreover,  there  exists  a  wiring  strategy  such  that  .\'r,  has  at  most  4(/(AV, )  +  1) 
bend  points.  Thus,  the  total  number  of  bend  points  of  all  the  representative  nets 
IS  0(n).  □ 

Without  loss  of  generality,  we  can  a.ssume  that  there  is  no  net  whose  internal 
boundary  contains  the  corner  (0,0).  If  there  is  such  a  net,  then  we  can  consider 
the  group  consisting  of  such  nets  separately.  The  overall  strategy  for  specifying 
the  routing  of  the  representative  nets  is  as  follows:  (i)  Unfold  R  into  the  line  L 
of  length  2/  +  2h  by  cutting  at  (0.0).  (ii)  specify  the  routing  of  the  representative 
nets  on  L.  (iii)  restore  R  by  cutting  and  folding  L  at  the  corners,  and  (iv)  remove 
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Figure  5.4:  The  union  of  the  bounding  perimeters.  J5,  is  the  bounding  perimeter 
of  .V, 


unnecessary  line  segments.  The  step-by-step  illustration  of  this  strategy  applied 
to  the  instance  of  Figure  5.2  is  given  in  Figure  5.3. 

For  each  net  .V,  =<  a„6,  >  on  R,  let  N'  =<  d{ai),d{b^)  >  =  <  o',  6-  >  be  the 
corresponding  net  on  L.  Let  the  rank  of  net  iV',  ranh{.\'),  be  the  number  of 
nets  that  cover  it.  Then  the  bounding  perimeter  of  :V'  is  the  region  that  starts 
at  a[  —  k  and  ends  at  b[  -f-  k,  and  whose  height  is  A:  +  1,  where  k  =  rank(.\l). 
Notice  that  the  wiring  of  the  representative  net  of  can  not  intersect  the  inside 
of  the  bounding  perimeter.  Figure  5.4  shows  the  contour  of  nets  A\',  Aj,  A3.  A'^ 
and  A'5  of  Figure  5.2.  We  claim  the  following. 

Lemma  5.4  The  union  of  all  the  bounding  perimeters  of  all  the 
group  determines  the  contour  of  the  group  and  hence  determines 
the  representative  net. 

Proof:  .Notice  that  no  portion  of  the  wiring  of  the  representative  net  can  lie 
inside  any  of  the  bounding  perimeters.  Therefore  it  has  to  lie  either  on  the 
contour  of  the  union  of  the  bounding  perimeters  or  it  will  have  some  portions 
outside  the  union.  We  will  show  that  the  region  determined  by  the  union  of  the 
bounding  perimeters  is  large  enough  for  routing  all  the  nets  in  the  group. 

The  proof  is  by  induction  on  the  number  n  of  the  nets.  The  case  of  n  =  1  is 
trivial.  Suppose  a  group  G  hcis  n  -f  1  nets.  If  we  remove  the  representative  net  S'l 
from  G.  we  may  obtain  several  groups  {6’,}.  By  the  induction  hypothesis,  the 


nets  within  a 
the  wiring  of 
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union  of  the  bounding  perimeters  of  each  group  determines  the  corresponding 
contour.  Since  the  rank  of  each  net  will  increase  by  1  if  we  put  A  '  back,  the 
union  of  bounding  perimeters  in  G  will  stretch  by  a  distance  of  1  which  is  just 
enough  to  wire  N^.  □ 

It  is  clear  that  the  rank  of  each  net  can  be  determined  by  a  similar  strategy 
as  in  steps  .3  and  4  of  procedure  Representative  Nets.  Moreover  it  is  easy 
to  specify  the  bounding  perimeter  of  each  net  once  its  rank  is  determined.  We 
now  discuss  the  problem  of  determining  the  contours  of  groups  of  nets  from  the 
corresponding  bounding  perimeters.  Let  two  nets  N'  =<  a'-,b\  >  and  A’j  =< 
a' ,  6'  >  be  such  tliat  rank-{N')  =  rank{Nj)  =  k  and  h\<  a'y  Then  the  bounding 
perimeters  of  A7  and  A’j  overlap  if  and  only  if  a'^  —  b\  <  2k.  Our  initial  task  is 
to  combine  overlapping  bounding  perimeters  of  the  same  height. 

Lemma  5.5  Let  A'j  be  the  nearest  net  to  the  right  of  X'  such  that  rank{\l)  = 
rank{Xj)  =  k.  Then, 

1.  the  rank  of  all  the  nets  whose  terminals  are  between  b[  and  a'  are  less  than 

k, 

2.  there  are  an  even  number  of  terminals  between  them,  with  half  of  them 
being  left  terminals  and  half  of  them  being  right  terminals,  and 

3.  if  the  bounding  perimeters  of  X[  and  Xj  overlap,  then  the  bounding  perime¬ 
ters  of  corresponding  pairs  of  nets  whose  terminals  lie  between  b\  and  o' 
overlap.  □ 

procedure  Net  Modification 

1.  Sort  the  terminals.  We  use  the  rank  information  of  each  terminal. 

2.  For  each  net  .V',  find  the  nearest  net  A''  to  the  right  such  that  rank{X')  = 
rank{Xj)  if  it  exists. 

3.  If  such  a  net  exists  and  «'  —  b[  <  2k,  then  remove  terminals  b[  aiid  o'  from 
the  sorted  list. 

I.  For  each  net  A  '  whose  b[  is  removed  and  whose  a[  is  not  removed,  find  the 
nearest  A  '  to  the  right  such  that  o'  is  removed  and  6'  is  not  removed,  and 
rank{.\')  =  rank{.\j).  .Make  a  net  consisting  of  the  two  terminals  o'  and 


Clearly,  the  contours  of  the  modified  nets  is  the  same  as  that  of  the  original 
nets,  .\otice  that  steps  2  and  4  can  be  implemented  by  the  integer  sorting 
algorithms  of  chapter  2  or  the  ANSV  algorithm  of  chapter  3. 
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Lemma  5.6  The  input  nets  can  be  modified  as  described  above  in  time  0{^P) 
on  the  pipelined  hypercube,  when  n  =  for  any  /  >  0.  □ 

Now  we  are  ready  to  give  the  procedure  to  determine  the  contours  of  the 
different  groups. 

procedure  Contours  on  L 

1.  Modify  input  nets  by  using  procedure  Net  Modification. 

2.  For  each  modified  net  N-  =<  a'-,b'-  >  of  rank  k,  produce  tw'o  points  (a-  — 
k,  A:  +  1)  and  (6'  —  k,  k  +  1). 

3.  Notice  that  several  points  from  left  (or  right)  terminals  may  have  the  same 
x-value.  However,  all  such  points  come  from  consecutive  nets  and  they  can 
be  identified  easily.  Remove  all  such  points  except  the  highest  point. 

4.  Construct  the  contours  with  the  resulting  set  of  points. 

Lemma  5.7  Given  n  input  nets,  the  contours  of  the  nets  can  be  found  in  time 
0{^P)  on  the  pipelined  hypercube,  when  n  =  for  any  /  >  0.  □ 

We  now  consider  the  problem  of  determining  the  contours  in  R  from  the  con¬ 
tours  on  L  by  cutting  and  folding  L  at  the  corners,  and  removing  the  unnecessary 
line  segments.  All  the  contours  on  L  that  do  not  cover  any  corner  can  easily 
be  modified  into  the  corresponding  contours  on  R.  Notice  that  there  is  at  most 
one  contour  that  cover  each  corner.  We  consider  here  the  contour  C  that  covers 
the  lower  right  corner.  All  the  other  contours  that  cover  the  other  corners  can 
be  treated  similarly.  Let  the  bottom  side  contour  of  C  be  C;,  =  {6],  62, . . . . /)p}, 
where  the  6,’s  are  line  segments,  ordered  from  left  to  right,  and  let  the  right  side 
contour  of  C  be  Cr  =  {^i,  r2, . . . ,  r,}  ordered  from  top  to  bottom.  We  assume 
that  Cr  does  not  cover  the  top  right  corner,  and  the  resulting  contour  on  R  is 
different  from  C(,  or  Cr  alone,  since  these  cases  can  be  treated  similarly. 

Lemma  5.8  Let  (6,,  r^)  be  an  intersecting  pair  of  line  segments  from  Ct,  and  Cr 
respectively  such  that  i  is  smallest.  Then  no  segment  b^,  k  >  i,  can  intersect  r,, 
.s  <  j. 

Proof:  .Notice  that  a  horizontal  segment  a  of  height  y  of  C  came  from  a  bound¬ 
ing  perimeter  of  rank  1/  —  1.  Hence  its  length  is  at  least  2j/  —  1  or  there  is  a 
higher  horizontal  segment  in  C  that  extends  u.  Let  {b,,rj)  be  an  intersecting 
pair  such  that  i  is  smallest,  and  (/  —  x,y)  be  the  intersecting  point,  .\ssume  that 
b,  is  horizontal  and  is  vertical.  The  other  case  is  symmetric.  Let  6^,  k  >  i. 
be  a  horizontal  segment  of  the  highest  //-coordinate,  say  r,  to  the  right  of  b,.  If 
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z  <  y,  then  we  are  done.  U  z  >  y,  then  x  >  y,  since  {I  —  x,y)  can  not  be  on  any 
line  segment  if  x  <  y.  Clearly  z  <  x.  Since  x  >  y,  bk  is  to  the  right  of  r^,  and 
the  lemma  follows.  □ 

It  follows  from  the  above  lemma  that  the  desired  contour  can  be  obtained 
by  finding  the  smallest  i  such  that  6,-  of  Cb  intersects  a  segment  r_,  of  Cr-  We 
can  find  such  an  intersection  a.s  follows.  We  first  find  h{a),  the  height  of  Cj,  at 
a,  a  =  1, . . . ,/.  If  is  horizontal  and  Vj  is  vertical,  then  the  intersecting  point 
(/—X,  y)  can  be  easily  found  by  checking  the  height  of  Cf,  for  each  vertical  segment 
of  Cr-  Clearly  the  steps  can  be  done  by  using  the  packet  routing  algorithm  of 
chapter  2. 

Assume  that  is  vertical  and  rj  is  horizontal.  Let  r*  be  a  vertical  segment 
with  smallest  x-coordinate,  say  x,  among  all  the  vertical  segments  of  Cr  below 
Cb-  Then  we  can  prove  that  the  sequence  of  horizontal  segments  of  Cb  from  x 
to  /  —  X  is  nonincreasing  in  the  ^-coordinate.  Thus,  we  can  find  the  intersecting 
point  in  the  same  way  as  the  previous  case.  After  finding  the  intersecting  point, 
we  can  easily  remove  the  unnecessary  line  segments.  We  now  state  the  main 
result  of  this  section. 

Theorem  5.2  Detailed  routing  of  the  representative  nets  of  n  nets  within  a 
rectangle  can  be  done  in  time  0(^P)  on  the  pipelined  hypercube,  when  n  =  p^'^~ . 
for  any  /  >  0.  □ 

Corollary  5.2  Detailed  routing  of  the  representative  nets  of  n  nets  within  a 
rectangle  can  be  done  in  time  on  the  shuffle- exchange  or  in  time 

0{  on  (fig  weak  hypercube,  the  cube-connected  cycle  and  the  butterfly, 

when  n  =  p^'^C  for  any  /  >  0.  □ 

5.4  Routability  Testing 

In  this  section,  we  address  the  problem  of  testing  whether  or  not  it  is  possible 
to  route  a  set  of  nets  in  a  given  rectangle  R.  Notice  that  the  detailed  routing 
algorithm  of  the  previous  section  assumes  that  the  nets  are  routable,  and  notice 
also  that  it  does  not  generate  enough  information  for  the  routability  testing 
since  it  only  produces  the  bend  points  of  the  representative  nets.  However  this 
algorithm  is  important  for  the  routability  testing  as  will  be  shown  in  this  section. 
The  routability  testing  algorithm  will  have  the  same  time  performance  as  the 
detailed  routing  algorithm. 

Given  a  set  of  nets  {.V,  =<  a,, 6^  >  |1  <  I  <  n)  in  a  rectangle  R.  these 
nets  may  not  be  routable  because  of  one  of  the  following  reasons:  (i)  the  graph 
determined  by  the  nets  in  the  rectangle  is  not  planar,  or  (ii)  the  wiring  of  all 
the  nets  requires  more  area.  The  first  case  can  be  settled  easily  by  sorting  and 
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computing  prefix  sums  as  in  steps  3  and  4  of  procedure  Representative  Nets. 
Actually,  the  graph  is  not  planar  if  and  only  if  there  is  a  net  iV,  =<  a,,  6,  >  such 
that  the  prefix  sum  value  of  a,  is  not  equal  to  that  of  6,  +  1. 

Before  we  tackle  the  question  of  whether  the  rectangle  is  large  enough  to  re¬ 
alize  all  the  nets,  we  need  several  definitions.  A  side  net  is  a  net  whose  terminals 
lie  on  the  same  side  of  the  rectangle.  A  side  group  is  a  group  whose  represen¬ 
tative  net  is  a  side  net.  A  corner  net  (group)  and  a  cross  net  (group)  can  be 
defined  similarly.  In  Figure  5.2,  the  side  group  is  {A"i4,  A'is},  the  corner  group  is 
{A^,  A2,  A3,  N4,  As},  and  the  cross  group  is  {Ae,  AV,  jVg,  A9,  Nw,  Nn,Ni2,  A'n}- 
The  overall  strategy  of  the  routability  testing  is  described  next. 

procedure  Routability  Testing 

1.  Partition  the  input  nets  into  groups. 

2.  Ignoring  corner  and  cross  groups,  determine  the  contours  of  the  side  groups 
and  test  if  the  combined  side  groups  are  routable.  If  they  are  not  routable 
then  report  ‘‘not  routable'’  and  stop. 

3.  For  each  corner  group,  test  whether  the  nets  in  the  group  are  rol;^^ble 
within  the  rectangle  ignoring  the  remaining  groups.  If  they  are,  find  the 
contour  of  the  group,  else  report  “not  routable"  and  stop.  Notice  that 
there  are  at  most  four  corner  groups. 

4.  For  each  cross  group,  test  whether  the  nets  in  the  group  are  routable  within 
the  rectangle  ignoring  the  remaining  groups.  If  they  arc,  find  the  contour 
of  the  group,  else  report  “not  routable'’  and  stop.  Notice  that  there  are  at 
most  two  cross  groups. 

5.  Test  whether  any  of  the  contours  generated  at  steps  2,  3  and  4  intersect. 
The  problem  is  routable  if  and  only  if  no  two  contours  intersect. 


Notice  that  if  the  rank  of  each  side  net  is  less  than  min(/,/i)  —  1,  the  side 
groups  on  a  side  of  R  are  clearly  routable  altogether.  Thus  step  2  of  this  algo¬ 
rithm  can  be  done  by  finding  the  contours  of  all  the  side  groups  with  procedure 
Contour  on  L,  and  then  testing  in  R  whether  any  two  contours  intersect  using 
a  strategy  as  in  finding  an  intersecting  point  of  contours  described  in  section 
5.3  (the  paragraph  following  the  proof  of  Lemma  5.7).  Step  5  can  also  be  done 
similarly. 

We  now  describe  the  routability  testing  of  the  corner  groups.  The  routability 
of  the  cross  groups  can  be  tested  similarly.  We  consider  the  corner  group  that 
covers  the  corner  (/,0).  The  other  corner  groups  can  be  treated  similarly.  If 
we  remove  all  the  corner  nets  from  the  corner  group,  the  remaining  side  nets 
can  be  partitioned  into  a  set  of  side  subgroups.  Let  sb,  [sVj)  be  a  subgroup  on 
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the  bottom  (light)  side  of  /?,  where  1  <  J  <  ^’6  (1  <  J  <  ^v)  from  right  to 
left  (bottom  to  top).  Then  we  define  the  density  between  sbi  and  sr^  to  be  the 
number  of  corner  nets  that  have  to  pass  between  the  contours  corresponding  to 
the  tw'o  subgroups  and  the  capacity  to  be  the  number  of  corner  nets  that  can 
pass  within  this  passage.  Let  cn(s6.)  (cn(srj))  be  the  number  of  corner  nets 
whose  terminals  lie  between  the  corner  (/,  0)  and  the  leftmost  terminal  of  the 
subgroup  s6,  (sr^  ).  We  can  find  such  numbers  by  using  the  jirefi.x  sums  algorithm 
on  each  side.  Clearly  the  density  between  s6,  and  sCj  is  \ch{sb,)  —  cn(srj)].  The 
following  algorithm  finds  the  capacities  and  tests  the  routability  of  the  corner 
group. 

procedure  Corner  Group 

1.  Remove  all  corner  nets  from  the  group  and  test  the  routability  of  the  re¬ 
maining  side  subgroups  with  a  strategy  as  in  slej)  2  of  procedure  Routabil¬ 
ity  Testing.  If  they  ciVf'  routable  then  find  the  contours,  else  report  "not 
routable’  and  stop. 

2.  We  assume  that  all  the  line  segments  of  the  contours  are  sorted  on  each 
side.  Let  b,  (cj)  be  the  Lth  (J-th)  segment  on  bottom  (right)  side  from 
right  to  loft  (bottom  to  top).  Let  D  be  the  diagonal  which  is  45  degree 
line  in  R  starting  from  the  corner  (/,0).  Find  the  projection.  p("j),  of  "j 
onto  D  and  find  p'(.^j)  =  p(rj)  —  U|.^*,p{r)().  In  Figure  5.5.  p'[C D)  =  CD' 
and  p'{EF)  =  D' F' . 

3.  For  each  corner  point  of  the  bottom  contours,  find  the  closest  point  x  on 
D  and  the  line  segment  I'j  of  the  right  contoui  '  such  that  //(cj)  contains 
X.  In  Figure  5.5.  the  line  segments  for  corner  points  A  and  R  arc  C D  and 
FF  respectively. 

4.  For  each  corner  point  of  the  right  contours,  find  such  a  line  segment  of  the 
bottom  contours  in  tlie  same  way. 

5.  If  there  is  a  corner  point  such  that  the  distance  between  the  point  and  the 
corresponding  line  segment  found  in  steps  4  and  5  is  less  than  the  density 
of  the  two  corresponding  subgroups  then  report  "not  routable”,  else  the 
corner  group  is  routable. 

Lemma  5.9  The  procedure  Corner  Group  tests  the  routability  of  the  corner 
group  in  time  0(-F)  on  the  pipelined  hypercube,  when  n  =  p^'^T  for  any  I  >  0. 
□ 

Theorem  5.3  Testing  the  routability  of  n  nets  within  a  re^'inugle  can  be  done 
in  time  0[jF)  on  the  pipelined  hypercube,  u'hen  n  =  for  any  /  >  0.  □ 


Chapter  6 
Conclusion 


6.1  Summary 

In  this  thesis,  we  showed  several  results  related  to  load  balancing,  sorting,  packet 
routing,  list  ranking,  graph  theory,  and  VLSI  routing  on  the  pipelined  hyper¬ 
cube,  the  weak  hypercube,  the  shuffle-exchange,  the  cube-connected  cycles,  and 
the  butterfly.  These  results  included  the  following. 

•  Development  of  provably  efficient  algorithms  on  the  above  models. 

•  Establishment  of  lower  bounds  on  the  weak  hypercube  and  bounded-degree 
networks.  All  the  problems  considered  were  shown  to  require  U{  — time 
on  bounded-degree  networks. 

These  results  shed  some  light  on  the  relative  powers  of  the  pipelined  hypercube, 
the  weak  hypercube,  and  the  bounded-degree  networks. 

In  Chapter  2,  we  showed  several  results  concerning  load  balancing  and  sort¬ 
ing,  and  related  them  to  the  general  packet  routing  problem.  We  presented 
an  algorithm  for  load  balancing  whose  time  complexity  is  0{M  -|-  logp)  on  the 
pipelined  hypercube  and  0(A/logp)  on  the  shuffle-exchange,  cube-connected 
cycles,  and  butterfly.  We  also  provided  a  lower  bound  for  our  bounded-degree 
networks,  and  showed  that  load  balancing  required  more  time  on  the  shuffle- 
exchange,  the  cube-connected-cycles,  or  the  butterfly  than  on  the  weak  hyper¬ 
cube. 

Many  of  our  algorithms  needed  to  sort  integers  from  a  small  range  efficiently. 
We  presented  an  0{^)  algorithm  for  sorting  integers  from  a  range  polynomial 
in  the  number  of  processors  for  the  pipelined  hypercube,  whenever  n  =  n(p' 
Integer  sorting  was  shown  to  require  D(  )  time  on  the  weak  hypercube  and 
on  any  bounded-degree  network. 

We  also  presented  an  algorithm  for  the  general  packet  routing  problem 
whose  time  complexity  is  0{ki  -1-^2  +  “)  on  the  pipelined  hypercube,  and 
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0{{ki  +  ^’2)  logp+  il12££)  on  the  weak  hypercube  and  our  bounded-degree  net¬ 
works,  whenever  n  =  The  problem  requires  n(2-l2fiLE)  time  on  the  weak 

hypercube  and  on  any  bounded-degree  networks.  Thus  the  the  upper  bounds 
were  tight  for  these  networks. 

In  Chapter  3,  we  presented  almost  uniformly  optimal  algorithms  to  solve 
several  problems  such  the  all  nearest  smaller  values  problem  (ANSV),  trian¬ 
gulating  a  monotone  polygon,  and  line  packing.  We  presented  an  algorithm  for 
the  ANSV  problem  whose  time  complexity  was  0{j  +  log^p)  on  the  pipelined 
hypercube  and  0{  -|-  log"*  p)  on  all  the  remaining  networks.  This  network 

algorithm  was  used  to  obtain  algorithms  for  triangulating  a  monotone  polygon 

and  link  packing.  We  also  proved  that  these  problems  require  Q(  time  on 

the  weak  hypercube  and  Q("*°SP)  time  on  our  bounded-degree  networks.  Thus, 
these  algorithms  were  also  afmost  uniformly  optimal  on  our  bounded-degree 
networks  (despite  being  only  almost  efficient). 

In  Chapter  4,  we  presented  a  list  ranking  algorithm  that  could  be  exe¬ 
cuted  on  the  pipelined  hypercube  in  time  O(^)  when  n  =  n(p^‘^'),  and  in 

time  0(  -t-  log^p)  otherwise.  We  used  these  techniques  to  obtain  fast  al¬ 

gorithms  ^r  several  basic  graph  problems  such  as  tree  expression  evaluation, 
connected  and  biconnected  components,  ear  decomposition,  and  st-numbering. 
These  problems  were  also  addressed  for  the  other  network  models.  We  also 
proved  that  list  ranking  requires  f](2i2££)  time  on  the  weak  hypercube  and  any 
bounded-degree  network.  Thus,  our  algorithm  was  optimal. 

In  Chapter  5,  we  presented  fast  algorithms  for  the  detailed  routing  and  the 
routability  testing  problems  within  a  rectangle  whose  time  complexities  were 
O(^)  on  the  pipelined  hypercube,  and  on  all  the  remaining  networks, 

when  n  =  n(p* Fast  algorithms  were  also  developed  for  several  subproblems 
that  were  interesting  in  their  own  right.  One  such  subproblem  was  to  determine 
the  contours  of  the  union  of  sets  of  contours  within  a  rectangle. 

6.2  Future  Research 

ANSV  is  a  basic  problem,  and  its  faster  solution  can  be  used  to  obtain  faster 
solutions  for  several  important  problems  such  cis  triangulating  a  monotone  poly¬ 
gon,  parenthesis  matching,  and  VLSI  routing  problems.  Our  solution  for  ANSV 
depended  on  the  block  permutation  algorithm  that  could  be  performed  faster 
if  we  could  find  a  faster  algorithm  to  set  up  the  data  paths  for  an  arbitrari- 
input  permutation  on  a  butterfly  permutation  network.  Thus,  finding  a  faster 
algorithm  for  setting  up  the  paths  on  our  network  models  is  an  important  open 
problem.  Finding  a  faster  solution  for  ANSV  with  or  without  the  block  permu¬ 
tation  scheme  on  the  network  models  is  also  an  important  open  problem. 
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Load  balancing  is  a  fundamental  problem  for  the  network  model  since  better 
balancing  over  the  processors  results  in  better  processor  utilization.  Our  load 
balancing  algorithm  is  optimal  on  the  pipelined  hypercube.  But  this  is  not 
optimal  on  Ho  and  Johnsson’s  more  powerful  hypercube  model  in  which  each 
processor  can  set  up,  and  send  or  receive  logp  packets  in  a  given  time  step  [31]. 
Actually,  when  M  =  H(n),  load  balancing  can  be  done  in  +  logp)  on 

this  more  powerful  model.  The  load  balancing  algorithm  is  also  not  optimal  on 
our  bounded-degree  networks  when  M  ^  \  Thus,  finding  a  general  optimal 
solution  for  load  balancing  problem  on  these  models  when  ^  <  M  <  n  is  an 
interesting  open  problem. 

Sorting  integers  is  also  a  fundamental  problem  and  we  provided  an  algorithm 
that  was  optimal  when  n  =  n(p^ ■*■').  That  algorithm  was  used  to  solve  the  packet 
routing  problem,  and  thus  wa.s  used  to  solve  several  other  problems  such  as  list 
ranking,  VLSI  routing,  and  graph-theoretic  problems.  Actually,  our  algorithm 
could  be  performed  in  0{^k^)  time,  when  n  =  p^'^i’.  Note  that  k  can  be  as  big 
as  logp,  and  the  algorithm  is  not  optimal  when  n  =  o(p*'^').  Thus,  finding  an 
integer  sorting  algorithm  that  has  a  better  performance,  when  n  =  o(p' "*■'),  is 
an  important  open  problem. 

Finally,  the  algorithm  for  finding  connected  components  of  a  graph  is  not 
efficient  on  the  pipelined  hypercube  even  when  n  =  H(p*''‘').  Since  a  faster 
solution  for  the  problem  can  be  directly  used  to  find  faster  solutions  for  other 
graph-theoretic  problems,  finding  a  faster  connected  component  algorithm  is 
also  an  important  open  problem. 


87 


Bibliography 


[1]  A.  Aggarwal  and  M.  A.  Huang.  Network  complexity  of  sorting  and  graph 
problems  and  simulating  CRCW  PRAM’s  by  interconnection  networks. 
Proc.  3rd  Aegean  Workshop  on  Computing,  AWOC  88,  Corfu,  Greece, 
June/July  1988,  pp.  339-350. 

[2]  R.J.  Anderson  and  G.L.  Miller.  Deterministic  parallel  list  ranking.  Proc. 
3rd  Aegean  Workshop  on  Computing,  AWOC  88,  Corfu,  Greece,  June/ July 
1988,  pp.  81-90. 

[3]  M.J.  Atallah  and  S.E.  Hambrusch.  Solving  tree  problems  on  a  mesh- 
connected  processor  array.  Proc.  IEEE  FOCS,  1985,  pp.  222-231. 

[4]  M.J.  Atallah  and  U.  Vishkin.  Finding  euler  tours  in  parallel.  Jour,  of  Comp, 
and  Sys.  Sci.,  Vol.  29,  Dec.  1984,  pp.  330-337. 

[5]  G.H.  Barnes,  R.M.  Brown,  M.  Kato,  D.J.  Kuck  and  D.L.  Slotnick.  The 
ILLIAC  IV  computer.  IEEE  Trans.  Comp.,  Vol.  C-17,  1968,  pp.  746-757. 

[6]  K.E.  Batcher.  Sorting  networks  and  their  applications.  Spring  Joint  Com¬ 
puter  Conference  32,  AFIPS  Press,  1968,  pp. 307-314. 

[7]  O.  Berkman,  B.  Schieber  and  U.  Vishkin,  Some  doubly  logarithmic  optimal 
parallel  algorithms  based  on  finding  all  nearest  smaller  values.  UMIACS- 
TR-88-79,  Univ.  of  Maryland,  1988. 

[8]  R.P.  Brent.  The  parallel  evalution  of  general  arithmetic  expressions.  J.ACM, 
Vol.  21,  1974,  pp.  201-206. 

[9]  S.-C.  Chang  and  J.  JaJa.  Parallel  algorithms  for  channel  routing  in  the 
knock-knee  model.  Submitted  to  SIAM  Jour,  on  Computing.  Also  available 
in  Proc.  1988  International  Conf.  on  Parallel  Processing. 

[10]  S.-C.  Chang,  J.  JaJa  and  K.W.  Ryu.  Optimal  parallel  algorithms  for  one- 
layer  routing.  Submitted  to  Jour,  of  Algorithms.  Also  available  as  Technical 
Report  UMIACS-TR-89-46,  CS-TR-2239,  Univ.  of  Maryland,  April  1989. 


88 


[11]  F.Y.  Chin,  J.  Lam,  and  I-Ngo  Chen.  Efficient  parallel  algorithms  for  some 
graph  problems.  CACM,  Vol.  25,  1982,  pp.  659-665. 

[12]  R.  Cole.  Parallel  merge  sort.  SIAM  J.  COMPUT.,  Vol.  17,  No.  4,  Aug. 
1988,  pp.  770-785. 

[13]  R.  Cole  and  A.  Siegel.  River  routing  every  which  way,  but  loose.  Proc.  IEEE 
FOCS,  1984,  pp.  65-73. 

[14]  R.  Cole  and  U.  Vishkin.  Deterministic  coin  tossing  with  application  to  op¬ 
timal  parallel  list  ranking.  Information  and  Control,  70,  1986,  pp.  32-53. 

[15]  R.  Cole  and  U.  Vishkin.  Deterministic  coin  tossing  and  accelerating  cas¬ 
cades:  Micro  and  macro  techniques  for  designing  parallel  algorith-^^  Proc. 
18th  ACM  Symp.  on  Theory  of  Computing,  1986,  pp.  206-219. 

[16]  R.  Cole  and  U.  Vishkin.  Faster  optimal  parallel  prefix  sums  and  list  ranking. 
TR-56/86,  The  Moise  and  Frida  Eskenasy  Inst,  of  Comp.  Sci.,  Tel  .\viv 
University. 

[17]  R.  Cole  and  U.  Vishkin.  Approximate  parallel  scheduling.  Part  2:  .Applica¬ 
tions  to  logarithmic-time  optimal  parallel  graph  algorithms.  To  be  published. 

[18]  R.  Cypher  and  J.L.C.  Sanz.  Cubesort:  An  optimal  sorting  algorithm  for 
feasible  parallel  computers.  Proc.  1988  international  Conf.  on  Parallel  Pro¬ 
cessing,  1988,  pp.  308-311. 

[19]  E.  Dekel  and  S.  Sahni.  Parallel  scheduling  algorithms.  Operations  Research. 
Vol.  31,  No.  1,  Jan. -Feb.  1983. 

[20]  D.  Dolev,  K.  Karplus,  A.  Seigel,  A.  Strong  and  J.  Ullman.  Optimal  wiring 
between  rectangles.  Proc.  13th  ACM  Symp.  on  Theory  of  Computing,  1981. 
pp.  312-317. 

[21]  D.  Eppstein  and  Z.  Galil.  Parallel  algorithmic  techniques  for  combinatorial 
computation.  Feb.  1988. 

[22]  A.V.  Goldberg,  S.A.  Plotkinand  G.E.  Shannon.  Parallel  symmetry-breaking 
in  sparse  graphs.  Proc.  19th  ACM  Symp.  on  Theory  of  Computing,  1987, 
pp.  315-324. 

[23]  A.  Gottlieb  and  C.P.  Kruskal.  Complexity  results  for  permuting  data  and 
other  computations  on  parallel  processors.  JACM,  Vol.  31,  1984,  pp. 193-209. 

[24]  T.  Hagerup.  Towards  optimal  parallel  bucket  sorting.  Information  and  Com¬ 
putation  75,  1987,  pp.  39-51 


89 


[25]  T.  Hagerup.  Optimal  parallel  algorithms  on  planar  graphs.  Proc.  3rd  Aegean 
Workshop  on  Computing,  AWOC  88,  Corfu,  Greece,  June/July  1988,  pp. 
24-32. 

[26]  Y.  Han.  Designing  fast  and  efficient  parallel  algorithms.  Ph.D.  Dissertation. 
Dept.  Computer  Sci.,  Duke  Univ.,  1987. 

[27]  Y.  Han.  .An  optimal  parallel  algorithm  for  computing  linked  list  prefix.  TR- 
100-87,  Dept,  of  Computer  Sci.,  Univ.  of  Kentucky,  Lexington,  1987. 

[28]  K.T.  Herley.  Efficient  simulations  of  small  shared  memories.  Proc.  IEEE 
FOCS,  1989,  pp.  390-395. 

[29]  K.T.  Herley  and  G.  Bilardi.  Deternistic  simulations  of  PRA\fs  Proc.  26th 
Allerton  Conf.  on  Communication,  Control  and  Computing,  1988,  pp.  1084- 
1093. 

[30]  D.S.  Hirschberg  and  A.K.  Chandra.  Computing  connected  components  on 
parallel  computers.  CACM,  Vol.  22,  No.  8,  Aug.  1979,  pp.  461-464. 

[31]  C.T.  Ho  and  S.L.  Johnson.  Distributed  routing  algorithms  for  broadcasting 
and  personalized  communication  in  hypercubes.  Proc.  1986  International 
Conf.  on  Parallel  Processing,  1986,  pp.  640-648. 

[32]  C.T.  Ho  and  S.L.  Johnson.  Algorithms  for  matrix  transposition  on  Boolean 
n-cube  configured  ensemble  architectures.  Proc.  1987  International  Conf.  on 
Parallel  Processing,  1987,  pp.  621-629. 

[33]  D.  Hoey  and  C.E.  Leiserson.  A  layout  for  the  shuffle-exchange  network. 
Proc.  1980  International  Conf.  on  Parallel  Processing,  1980,  pp.  329-336. 

[34]  J.  JaJa  and  K.VVh  Ryu.  Efficient  techniques  for  routing  and  for  solving 
graph  problems  on  the  hypercube.  UML'\CS-TR-S9-33,  CS-TR-2216,  Univ. 
of  Maryland,  March  1989. 

[35]  J.  JaJa  and  K.W.  Ryu.  Load  balancing  and  routing  on  the  hypercube  and 
related  networks.  Submitted  to  Jour,  of  Parallel  and  Distributed  Comput¬ 
ing.  Also  available  as  Technical  Report  UML'\CS-TR-S9-61,  CS-TR-2264, 
Univ.  of  Maryland,  June,  1989. 

[36]  D.  Johannsen.  Bristle  blocks:  a  silicon  compiler.  Proc.  16th  Design  .Au¬ 
tomation  Conference,  1979,  pp.  310-313. 

[37]  S.L.  JohnsonCommunication  efficient  basic  linear  algebra  computations  on 
hypercube  architecture.  Jour,  of  Paralle'  and  Distributed  Computing.  \ol. 
4,  1987,  pp.  133-172. 


90 


[38]  R.M.  Karp  and  V.  Ramachandran.  A  survey  of  parallel  algorithms  for 
shared-memory  machines.  Report  No.  UCB/CSD  88/408,  Comp.  Sci.  Div., 
Univ.  of  California,  Berkeley,  Mar.  1988. 

[39]  D.E.  Knuth.  The  art  of  computer  programming,  Vol.  HI:  Sorting  and  search¬ 
ing.  Addison- Wesley,  1973. 

[40]  S.R.  Kosaraju  and  A.L.  Delcher.  Optimal  parallel  evaluation  of  tree- 
structured  computations  by  raking.  Proc.  3rd  Aegean  Workshop  on  Com¬ 
puting,  AWOC  88,  Corfu,  Greece,  June/July  1988,  pp.  101-110. 

[41]  M.  Kramer  and  J.  van  Leeuwen.  Wire  routing  is  NP-complete.  TR.  Univer¬ 
sity  of  Utrecht,  the  Netherlands,  Feb.  1982. 

[42]  C.P.  Kruskal,  T.  Madej  and  L.  Rudolph.  Parallel  prefix  on  fully  connected 
direct  connection  machines.  Proc.  1986  International  Conf.  on  Parallel  Pro¬ 
cessing,  pp.  278-284. 

[43]  C.P.  K  ruskal,  L.  Rudolph  and  M-  Snir.  A  complexity  theory  of  efficient 
parallel  algorithms.  To  appear  in  Theoretical  Computer  Science. 

[44]  C.P.  Kruskal,  L.  Rudolph  and  M.  Snir.  The  power  of  parallel  prefix.  IEEE. 
Trans.  Comp.,  Vol.  C-34,  No.  10,  Oct.  1985,  pp.  965-968. 

[45]  H.T.  Kung.  The  structure  of  parallel  algorithm.  Advances  in  Computers, 
Vol.  19,  Academic  Press,  Inc.,  1980,  pp.  65-112. 

[46]  A.  LaPaugh.  Algorithms  for  integrated  circuit  layout:  an  analytic  approach. 
Ph.D.  dissertation,  MIT,  Cambridge,  MA,  Nov.  1980. 

[47]  T.  Leighton.  Tight  bounds  on  the  complexity  of  parallel  sorting.  IEEE. 
Trans.  Comp.,  Vol.  C-34,  No.  4,  April  1985,  pp.  344-354. 

[48]  C.E.  Leiserson  and  F.M.  Maley.  Algorithms  for  routing  and  testing  routabil- 
ity  of  planar  VLSI  layouts.  Proc.  17th  ACM  Symp.  on  Theory  of  Comput¬ 
ing,  1985,  pp.  69-78. 

[49]  C.  Leiserson  and  R.  Pinter.  Optimal  placement  for  river  routing.  SI.A.M  J. 
COMPUT.  Vol.  12,  No.  3,  Aug.  1983,  pp.  447-462. 

[50]  G.F.  Lev,  N.  Pippenger  and  L.G.  Valiant.  .4  fast  parallel  algorithm  for 
routing  in  permutation  networks.  IEEE  Trans.  Comp.,  Vol.  C-30,  No.  2, 
Feb.  1981,  pp.  9.3-100. 

[51]  F.M.  Maley.  Toward  a  mathematicu.1  theory  of  single-layer  wire  routing.  5th 
MIT  Conference  on  Advanced  Research  in  VLSI,  March  1988,  pp.  277-296. 


91 


[52]  Y.  Maon,  B.  Schieber  and  U.  Vishkin.  Parallel  ear  decomposition  search 
(EDS)  and  st-numbering  in  graphs.  Theoretical  Comp.  Sci.  47,  1986,  pp. 
277-298. 

[53]  A.  Mirzaian.  Channel  routing  in  VLSI.  Proc.  16th  ACM  Symp.  on  Theory 
of  Computing,  1984,  pp.  101-107. 

[54]  A.  Moitra  and  S.S.  Iyengar.  Parallel  algorithms  for  some  computational 
problems.  Advances  on  Computers,  Vol.  26,  Academic  Press,  Inc.,  1987. 
pp.  93-153. 

[55]  D.  Nassimi  and  S.  Salmi.  Bitonic  sort  on  a  mesh-connected  parallel  com¬ 
puter.  IEEE  Trans,  on  Comp.,  Vol.  C-27,  No.  1,  Jan.  1979,  pp.  2-7. 

[56]  D.  Nassimi  and  S.  Sahni.  Data  broadcasting  in  SIMD  computers.  IEEE 
Trans,  on  Comp.,  Vol.  C-30,  No.  2,  Feb.  1981,  pp.  101-107. 

[57]  D.  Nassimi  and  S.  Sahni.  Parallel  algorithms  to  set  up  the  benes  permutation 
network.  IEEE  Trans,  on  Comp.,  Vol.  C-31,  No.  2,  Feb.  1982,  pp.  148-154. 

[58]  D.  Peleg  and  E.  Upfal.  The  generalized  packet  routing  problem.  Theoretical 
Comp.  Sci.  53,  1987,  pp.  281-293. 

[59]  R.  Pinter.  River  routing:  methodology  and  analysis.  Proc.  3rd  CALTECH 
Conference  on  Very  Large  Scale  Integration,  1983,  pp.  141-163. 

[60]  N.J.  Pippenger.  On  simultaneous  resource  bounds.  Proc.  IEEE  FOCS,  1979. 
pp.  307-311. 

[61]  C.G.  Plaxton.  Load  balancing,  selection  and  sorting  on  the  hypercube.  Proc. 
1989  ACM  Symp.  on  Parallel  Algorithms  and  Architectures,  pp.  64-73. 

[62]  F.P.  Preparata  and  J.  Vuillemin.  The  cube-connected  cycle:  A  versatile 
network  for  parallel  computation.  CACM  24,  1981,  pp.  300-309. 

[63]  K.VV.  Ryu  and  J.  JaJa.  List  ranking  on  the  hypercubt  Proc.  1989  Interna¬ 
tional  Conf.  on  Parallel  Processing,  Vol.  Ill,  pp.  20-23. 

[64]  K.W.  Ryu  and  J.  JaJa.  Efficient  algorithms  for  list  ranking  and  for  solving 
graph  problems  on  the  hypercube.  IEEE  Trans.  Parallel  and  Distributed 
Systems,  Vol.  1,  No.  1,  Jan.  1990,  pp.  83-90. 

[65]  Y.  Saad  and  M.II.  Schultz.  Data  communication  in  hypercubes.  TR 
YALEU/DCS/RR-428,  Yale  University,  Oct.  1985. 

[66]  Y.  Saad  and  M.II.  Schultz.  Topological  properties  of  hypercubes.  IEEE 
Trans.  Comp.,  Vol.  C-37,  No.  7,  July  1988,  pp.  867-872. 


92 


[67]  S.  Sahni  and  A.  Bhatt.  Complexity  of  the  Design  Automation  Problem. 
Proc.  17th  Design  Automation  Conference,  June  1980,  pp.  402-411. 

[68]  B.  Schieber  and  U.  Vishkin.  On  finding  lowest  common  ancestors:  simpli¬ 
fication  and  parallelization.  SIAM  J.  COMPUT.,  V'ol.  17,  No.  6,  Dec.  1988, 
pp.  1253-1262. 

[69]  B.  Schieber  and  U.  Vishkin.  Finding  all  nearest  neighbors  for  convex  poly¬ 
gons  in  parallel:  A  new  lower  bound  technique  and  a  matching  algorithm. 
UMlACS-tr-S8-82,  CS-TR-2138,  Univ.  of  Maryland,  Nov.  1988. 

[70]  E.  Schwabe.  Normal  hypercube  algorithms  can  be  simulated  on  a  butterfly 
with  only  constant  slowdown.  Manuscript,  1989. 

[71]  J.T.  Schwartz.  Ultracomputers.  ACM  Trans.  Prog.  Languages  and  Systems. 
Vol.  2,  No.  4,  Oct.  1980,  pp.  484-521. 

[72]  A.  Seigel  and  D.  Dolev.  Some  Geometry  for  General  River  Routing.  SIAM 
J.  COMPUT.  Vol.  17,  No.  3,  June  1988,  pp.  583-605. 

[73]  C.L.  Seitz.  The  cosmic  cube.  C.ACM  28,  1985,  pp.  22-33. 

[74]  Y.  Shiloach  and  U.  Vhshkin.  An  O(logn)  parallel  connectivity  algorithm. 
Jour,  of  Algorithms  3,  1982,  pp.  57-67. 

[75]  H.S.  Stone.  Parallel  Processing  with  The  Perfect  Shuffle.  IEEE  Trans,  on 
Comp.,  Vbl.  c-20,  No.  2,  Feb.  1971,  pp.  153-161. 

[76]  T.  Szymanski.  Dogleg  Channel  routing  is  NP-romplete.  Manuscri])!.  Bell 
Laboratories,  .Murray  Hill,  NJ,  Sep.  1981. 

[77]  R.E.  Tarjan  and  U.  Vishkin.  An  efificient  parallel  biconnectivity  algorithm. 
SIAM  J.  COMPUT.,  Vol.  14,  No.  4,  Nov.  1985.  pp.  862-874. 

[78]  C.D.  Thompson  and  11. T.  Rung.  Sorting  on  a  mesh-connected  parallel  com¬ 
puter.  CACM,  Vbl.  20,  No.  4,  .Apr.  1977,  pp.  263-271. 

[79]  M.  Tompa.  An  optimal  solution  to  a  wire  routing  problem.  Proc.  12th  .ACM 
Symp.  on  Theory  of  Computing,  1980,  pp.  161-176. 

[SO]  Y.Il.  Tsin  and  F.Y.  Chin.  Efficient  parallel  algorithms  for  a  class  of  graph 
theoretic  problems.  SIAM  J.  COMPUT.,  Vbl  13,  No.  3,  .Aug.  1984.  pp. 
580-599. 

[81]  J.D.  Ullman.  Computational  aspects  of  VLSI.  Computer  Science  Press.  Inc., 
1984. 


93 


[82]  P.  Varman  and  K.  Doshi.  Sorting  with  linear  speedug  in  a  VLSI  network. 
Proc.  1988  International  Conf.  on  Parallel  Processing,  1988,  pp.  202-206. 

[83]  U.  Vishkin.  Synchronous  parallel  computation  -  A  survey.  TR-^71, 
Courant  Institute,  New  York  University,  Apr.  1983. 

[84]  U-  Vishkin.  An  optimal  parallel  connectivity  algorithm.  Disc.  App.  Math.  9, 
1984,  pp.  197-207. 

[85]  U.  Vishkin.  On  efficient  parallel  strong  orientation.  Infor.  Proc.  Letters  20, 
June  1985,  pp.  235-240. 

[86]  A.  Waksman.  .4  permutation  network.  JACM,  Vol.  15,  No.  1,  Jan.  1968, 
pp.  159-163. 

[87]  J.C.  Wyllie.  The  complexity  of  parallel  computation.  TR  79-387,  Dept,  of 
Computer  Sci.,  Cornell  Univ.,  Ithaca,  NY,  1979. 


94 


Curriculum  Vitae 


Name: 

Permanent  address: 

Degree  and  date  to  be 
conferred: 

Date  of  birth: 

Place  of  birth: 

Secondary  education: 


Kwan  Woo  Ryu 

228-4,  Beomo-dong,  Sooseong-ku. 
Daegu,  Korea 

Ph.D.,  1990 
November  2.  1957 
Andong,  Korea 

Kyungpook  High  School,  Daegu,  Korea 


Collegiate  institutions  attended: 

Institution  Dates  Attended  Degree  Date  of  Degree 

University  of  Maryland 

College  Park,  MD  20742  8/85  Ph.D.  5/90 

U.S.A. 


Korea  Advanced  Institute 
of  Science  and  Technology 
Seoul,  Korea 

Kyungpook  Natl.  Univ. 
Daegu,  Korea 

Major: 


3/80  M.S. 

3/76  B.S. 

Computer  Science 


2/82 


2/SO 


Professional  publications: 

[1]  Efficient  AUjo'ithms  for  List  Ranking  and  for  Solving  Graph  Problems 
on  the  Hire.rcube  (with  J.  JaJa),  IEEE  Trans.  Parallel  and  Distributed 
Systems,  Vol.  1,  No.  1,  Jan.  1990,  pp.  83-90. 

[2]  Optimal  Parallel  Algorithms  for  One-Lager  Routing  {with  S.-C.  Chang  and 
J.  JaJa),  submitted  to  Journal  of  Algorithms.  .Also  available  as  Technical 
Report  UMIACS-TR-89-46,  CS-TR-22.39,  Univ.  of  Maryland.  April  1989. 


[3]  Load  Balancing  and  Routing  on  the  Hgpercube  and  Related  Xetworks 
(with  J.  JaJa),  submitted  to  Journal  of  F’arallel  and  Distributed  Comput¬ 
ing.  Also  available  as  Technical  Report  UMIACS-TR-S9-61,  CS-TR-22b4. 
Univ.  of  Maryland,  June,  19S0. 

[4]  List  Ranking  on  the  Hgpercube  (with  J.  JaJa),  Proc.  of  1989  International 
Conference  on  Parallel  Processing,  Vol.  Ill,  pp.  20-23. 

[5]  Almost  Uniformly  Efficient  Parallel  Algorithms  for  Several  Problems  on 
the  Network  Model  (with  J.  JaJa),  submitted  to  1990  Symp.  on  Parallel 
.\lgorithms  and  Architectures. 

[6]  Efficient  Techniques  for  Routing  and  for  Solving  Graph  Problems  on  the 
Hgpercube  (with  J.  JaJa),  LMI.\CS-TR-S9-33,  CS-TR-2216.  Univ.  of 
Maryland,  March  1989. 


Professional  positions  held: 

6/90-  .Assistant  Professor 

Kyungpook  Natl.  Univ. 

Daegu,  Korea. 

7/89  -  6/90  Graduate  Research  Fellow 

Institute  for  .Advanced  Com|)ut('r  Studi(\s. 

University  of  .Maryland 

College  Park,  .MD  20742.  U.  S.  A. 

8/85  -  6/89  Graduate  Assistant 

Dept,  of  Computer  Science, 

Systems  Research  Center, 

Institute  for  .Advanced  (.'omputer  Studies, 

University  of  .Maryland 

College  Park,  .MD  20742,  U.  S.  A. 

3/82  -  7/85  Full  I’ime  Lecturer 

Dept,  of  Electrical  Engineering 
Kyungpook  Xatl.  Univ.,  Daegu.  Korea. 

3/80  -  2/82  Research  .Assistant 

Department  of  Com])uter  Science 

Korea  Advance  Institute  of  Science  and  Technology 

Seoul,  Korea 


